Memory allocation

2020-04-17 Thread Pat Ferrel
I have used Spark for several years and realize from recent chatter on this list that I don’t really understand how it uses memory. Specifically is spark.executor.memory and spark.driver.memory taken from the JVM heap when does Spark take memory from JVM heap and when it is from off JVM heap.

Re: IDE suitable for Spark

2020-04-07 Thread Pat Ferrel
IntelliJ Scala works well when debugging master=local. Has anyone used it for remote/cluster debugging? I’ve heard it is possible... From: Luiz Camargo Reply: Luiz Camargo Date: April 7, 2020 at 10:26:35 AM To: Dennis Suhari Cc: yeikel valdes , zahidr1...@gmail.com , user@spark.apache.org

Re: k8s orchestrating Spark service

2019-07-03 Thread Pat Ferrel
todays question. From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 5:14:05 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service > We’d like to deploy Spark Workers/Executors and Master (whatever master is easiest to talk about since we really do

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
run our Driver and Executors considering that the Driver is part of the Server process? Maybe we are talking past each other with some mistaken assumptions (on my part perhaps). From: Pat Ferrel Reply: Pat Ferrel Date: July 1, 2019 at 4:57:20 PM To: user@spark.apache.org , Matt Cheah

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
anyone have something they like? From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 4:45:55 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service Sorry, I don’t quite follow – why use the Spark standalone cluster as an in-between layer when one can just

Re: k8s orchestrating Spark service

2019-07-01 Thread Pat Ferrel
of services including Spark. The rest work, we are asking if anyone has seen a good starting point for adding Spark as a k8s managed service. From: Matt Cheah Reply: Matt Cheah Date: July 1, 2019 at 3:26:20 PM To: Pat Ferrel , user@spark.apache.org Subject: Re: k8s orchestrating Spark service

k8s orchestrating Spark service

2019-06-30 Thread Pat Ferrel
We're trying to setup a system that includes Spark. The rest of the services have good Docker containers and Helm charts to start from. Spark on the other hand is proving difficult. We forked a container and have tried to create our own chart but are having several problems with this. So back to

Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster. The danger increases with the semver change and this one is not just a build #. In other word 2.4 is considered to be a fairly major change from 2.3. Not much else can be said. From: Nicolas Paris Reply:

Fwd: Spark Architecture, Drivers, & Executors

2019-05-17 Thread Pat Ferrel
In order to create an application that executes code on Spark we have a long lived process. It periodically runs jobs programmatically on a Spark cluster, meaning it does not use spark-submit. The Jobs it executes have varying requirements for memory so we want to have the Spark Driver run in the

Re: Spark structured streaming watermarks on nested attributes

2019-05-06 Thread Pat Ferrel
Streams have no end until watermarked or closed. Joins need bounded datasets, et voila. Something tells me you should consider the streaming nature of your data and whether your joins need to use increments/snippets of infinite streams or to re-join the entire contents of the streams accumulated

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Pat Ferrel
@Riccardo Spark does not do the DL learning part of the pipeline (afaik) so it is limited to data ingestion and transforms (ETL). It therefore is optional and other ETL options might be better for you. Most of the technologies @Gourav mentions have their own scaling based on their own compute

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
Thanks, are you referring to https://github.com/spark-jobserver/spark-jobserver or the undocumented REST job server included in Spark? From: Jason Nerothin Reply: Jason Nerothin Date: March 28, 2019 at 2:53:05 PM To: Pat Ferrel Cc: Felix Cheung , Marcelo Vanzin , user Subject: Re

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
;-) Great idea. Can you suggest a project? Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only launches trivially in test apps since most uses are as a lib. From: Felix Cheung Reply: Felix Cheung Date: March 28, 2019 at 9:42:31 AM To: Pat Ferrel , Marcelo Vanzin Cc

Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
e mode you might be able to use this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala Lastly, you can always check where Spark processes run by executing ps on the machine, i.e. `ps aux | grep java`. Best, Jianneng *From:* Pat Ferrel *Dat

Re: spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
Reply: Marcelo Vanzin Date: March 26, 2019 at 1:59:36 PM To: Pat Ferrel Cc: user Subject: Re: spark.submit.deployMode: cluster If you're not using spark-submit, then that option does nothing. If by "context creation API" you mean "new SparkContext()" or an equ

spark.submit.deployMode: cluster

2019-03-26 Thread Pat Ferrel
I have a server that starts a Spark job using the context creation API. It DOES NOY use spark-submit. I set spark.submit.deployMode = “cluster” In the GUI I see 2 workers with 2 executors. The link for running application “name” goes back to my server, the machine that launched the job. This is

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
:07 AM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. All the docs I see appear to always describe needing to use spark-submit for cluster mode -- it's not even compatible

Re: Where does the Driver run?

2019-03-25 Thread Pat Ferrel
only guessing at that). Further; if we don’t use spark-submit we can’t use deployMode = cluster ??? From: Akhil Das Reply: Akhil Das Date: March 24, 2019 at 7:45:07 PM To: Pat Ferrel Cc: user Subject: Re: Where does the Driver run? There's also a driver ui (usually available on port 4040

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
60g BTW I would expect this to create one Executor, one Driver, and the Master on 2 Workers. From: Andrew Melo Reply: Andrew Melo Date: March 24, 2019 at 12:46:35 PM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
60g From: Andrew Melo Reply: Andrew Melo Date: March 24, 2019 at 12:46:35 PM To: Pat Ferrel Cc: Akhil Das , user Subject: Re: Where does the Driver run? Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel wrote: > Thanks, I have seen this many times in my research. Paraphrasi

Re: Where does the Driver run?

2019-03-24 Thread Pat Ferrel
: Akhil Das Date: March 23, 2019 at 9:26:50 PM To: Pat Ferrel Cc: user Subject: Re: Where does the Driver run? If you are starting your "my-app" on your local machine, that's where the driver is running. [image: image.png] Hope this helps. <https://spark.apache.org/docs/l

Where does the Driver run?

2019-03-23 Thread Pat Ferrel
I have researched this for a significant amount of time and find answers that seem to be for a slightly different question than mine. The Spark 2.3.3 cluster is running fine. I see the GUI on “ http://master-address:8080;, there are 2 idle workers, as configured. I have a Scala application that

Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
Executor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurren

Spark with Kubernetes connecting to pod id, not address

2019-02-12 Thread Pat Ferrel
From: Pat Ferrel Reply: Pat Ferrel Date: February 12, 2019 at 5:40:41 PM To: user@spark.apache.org Subject:  Spark with Kubernetes connecting to pod id, not address We have a k8s deployment of several services including Apache Spark. All services seem to be operational. Our application

Give a task more resources

2017-01-11 Thread Pat Ferrel
I have a task that would benefit from more cores but the standalone scheduler launches it when only a subset are available. I’d rather use all cluster cores on this task. Is there a way to tell the scheduler to finish everything before allocating resources to a task? Like "finish everything

Memory allocation for Broadcast values

2015-12-20 Thread Pat Ferrel
I have a large Map that is assembled in the driver and broadcast to each node. My question is how best to allocate memory for this. The Driver has to have enough memory for the Maps, but only one copy is serialized to each node. What type of memory should I size to match the Maps? Is the

RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions as a

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
wrote: Are you sure it's isEmpty? and not an upstream stage? isEmpty is definitely the action here. It doesn't make sense that take(1) is so much faster since it's the "same thing". On Wed, Dec 9, 2015 at 5:11 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Sure, I thought th

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
<so...@cloudera.com> wrote: > > On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> The “Any” is required by the code it is being passed to, which is the >> Elasticsearch Spark index writing code. The values are actually RDD[(String, >> Map[String,

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
river is trivial. Or maybe adapt some version of the implementation of take() to be an optimized, smarter isEmpty(). Neither seemed worth the overhead at the time, but this could be a case against that, if it turns out somehow to be serialization time. On Wed, Dec 9, 2015 at 5:55 PM, Pat Ferr

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
<so...@cloudera.com> wrote: It should at best collect 1 item to the driver. This means evaluating at least 1 element of 1 partition. I can imagine pathological cases where that's slow, but, do you have any more info? how slow is slow and what is slow? On Wed, Dec 9, 2015 at 4:41 PM, Pat

rdd.saveAsSequenceFile(path)

2015-06-27 Thread Pat Ferrel
Our project is having a hard time following what we are supposed to do to migrate this function from Spark 1.2 to 1.3. /** * Dump matrix as computed Mahout's DRM into specified (HD)FS path * @param path */ def dfsWrite(path: String) = { val ktag = implicitly[ClassTag[K]]

Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-x’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
(rowIDColumn) - tokens(columnIDPosition) } interactions.cache() On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele gangele...@gmail.com wrote: Will you be able to paste code here? On 23 April 2015 at 22:21, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Using Spark

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Argh, I looked and there really isn’t that much data yet. There will be thousands but starting small. I bet this is just a total data size not requiring all workers thing—sorry, nevermind. On Apr 23, 2015, at 10:30 AM, Pat Ferrel p...@occamsmachete.com wrote: They are in HDFS so available

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
, 2015, at 10:23 AM, Sean Owen so...@cloudera.com wrote: Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver? On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel p...@occamsmachete.com wrote: Sure var columns = mc.textFile(source

Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
They are in HDFS so available on all workers On Apr 23, 2015, at 10:29 AM, Pat Ferrel p...@occamsmachete.com wrote: Physically? Not sure, they were written using the nano-batch rdds in a streaming job that is in a separate driver. The job is a Kafka consumer. Would that effect all derived

TaskResultLost

2015-04-14 Thread Pat Ferrel
Running on Spark 1.1.1 Hadoop 2.4 with Yarn AWS dedicated cluster (non-EMR) Is this in our code or config? I’ve never run into a TaskResultLost, not sure what can cause that. TaskResultLost (result lost from block manager) nivea.m https://gd-a.slack.com/team/nivea.m[11:01 AM] collect at

Re: Need Advice about reading lots of text files

2015-03-17 Thread Pat Ferrel
/15e72f7bc22337cf6653 Michael On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: It’s a long story but there are many dirs with smallish part- files in them so we create a list of the individual files as input to sparkContext.textFile

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
... :: Nil).flatMap(new ReadLinesSafe(_)) You can also build up the list of files by running a Spark job: https://gist.github.com/marmbrus/15e72f7bc22337cf6653 https://gist.github.com/marmbrus/15e72f7bc22337cf6653 Michael On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel p...@occamsmachete.com mailto:p

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
many tasks (its not a jvm per task). On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com mailto:p

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList

Need Advice about reading lots of text files

2015-03-13 Thread Pat Ferrel
We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large (maybe 1) and is all on hdfs. The question is: what is the most performant way to read them? Should they be

Re: Column Similarities using DIMSUM fails with GC overhead limit exceeded

2015-03-02 Thread Pat Ferrel
Sab, not sure what you require for the similarity metric or your use case but you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise) here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

Re: Upgrade to Spark 1.2.1 using Guava

2015-03-02 Thread Pat Ferrel
: Yes. I ran into this problem with mahout snapshot and spark 1.2.0 not really trying to figure out why that was a problem, since there were already too many moving parts in my app. Obviously there is a classpath issue somewhere. /Erlend On 27 Feb 2015 22:30, Pat Ferrel p...@occamsmachete.com

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-28 Thread Pat Ferrel
where the serializer is and where the user class is. At the top you said Pat that you didn't try this, but why not? On Fri, Feb 27, 2015 at 10:11 PM, Pat Ferrel p...@occamsmachete.com wrote: I’ll try to find a Jira for it. I hope a fix is in 1.3 On Feb 27, 2015, at 1:59 PM, Pat Ferrel p

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
25, 2015 at 5:17 PM, Pat Ferrel p...@occamsmachete.com wrote: The root Spark pom has guava set at a certain version number. It’s very hard to read the shading xml. Someone suggested that I try using userClassPathFirst but that sounds too heavy handed since I don’t really care which version

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
erl...@hamnaberg.net wrote: Hi. I have had a simliar issue. I had to pull the JavaSerializer source into my own project, just so I got the classloading of this class under control. This must be a class loader issue with spark. -E On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote: I don’t use spark-submit I have a standalone app. So I guess you want me to add that key/value to the conf in my code and make sure it exists on workers. On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I’ll try to find a Jira for it. I hope a fix is in 1.3 On Feb 27, 2015, at 1:59 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks! that worked. On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote: I don’t use spark-submit I have a standalone app. So I guess you want me

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-27 Thread Pat Ferrel
I changed in the spark master conf, which is also the only worker. I added a path to the jar that has guava in it. Still can’t find the class. Trying Erland’s idea next. On Feb 27, 2015, at 1:35 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel p

upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1.1.0 I simply registered the class and its serializer with kryo like this:

Re: Row similarities

2015-01-18 Thread Pat Ferrel
algebra first? On Jan 17, 2015, at 6:27 PM, Reza Zadeh r...@databricks.com wrote: We're focused on providing block matrices, which makes transposition simple: https://issues.apache.org/jira/browse/SPARK-3434 https://issues.apache.org/jira/browse/SPARK-3434 On Sat, Jan 17, 2015 at 3:25 PM, Pat Ferrel

Re: Row similarities

2015-01-17 Thread Pat Ferrel
these together. On Jan 17, 2015, at 9:37 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Excellent, thanks Pat. On Jan 17, 2015, at 9:27 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: Mahout’s Spark implementation of rowsimilarity is in the Scala

Re: Row similarities

2015-01-17 Thread Pat Ferrel
rowSimilarities. On Sat, Jan 17, 2015 at 10:45 AM, Pat Ferrel p...@occamsmachete.com mailto:p...@occamsmachete.com wrote: BTW it looks like row and column similarities (cosine based) are coming to MLlib through DIMSUM. Andrew said rowSimilarity doesn’t seem to be in the master yet. Does anyone know

Re: Row similarities

2015-01-17 Thread Pat Ferrel
Mahout’s Spark implementation of rowsimilarity is in the Scala SimilarityAnalysis class. It actually does either row or column similarity but only supports LLR at present. It does [AA’] for columns or [A’A] for rows first then calculates the distance (LLR) for non-zero elements. This is a major

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Pat Ferrel
Actually the spark-itemsimilarity job and related code in the Spark module of Mahout creates all-pairs similarity too. It’s designed to use with a search engine, which provides the query part of the recommender. Integrate the two and you have a near realtime scalable item-based/cooccurrence

Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine.

Cores on Master

2014-11-18 Thread Pat Ferrel
I see the default and max cores settings but these seem to control total cores per cluster. My cobbled together home cluster needs the Master to not use all its cores or it may lock up (it does other things). Is there a way to control max cores used for a particular cluster machine in

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores n' to the master? No config/env way? On Nov 18, 2014, at 3:14 PM, Pat Ferrel p...@occamsmachete.com wrote: I see the default and max cores settings but these seem to control total cores

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote: Looks like I can do this by not using start-all.sh but starting each worker separately passing in a '--cores

Re: Cores on Master

2014-11-18 Thread Pat Ferrel
OK hacking the start-slave.sh did it On Nov 18, 2014, at 4:12 PM, Pat Ferrel p...@occamsmachete.com wrote: This seems to work only on a ‘worker’ not the master? So I’m back to having no way to control cores on the master? On Nov 18, 2014, at 3:24 PM, Pat Ferrel p...@occamsmachete.com wrote

Re: Class not found

2014-10-21 Thread Pat Ferrel
maven cache is laid out differently but it does work on Linux and BSD/mac. Still looks like a hack to me. On Oct 21, 2014, at 1:28 PM, Pat Ferrel p...@occamsmachete.com wrote: Doesn’t this seem like a dangerous error prone hack? It will build different bits on different machines. It doesn’t

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Pat Ferrel
Trying to upgrade from Spark 1.0.1 to 1.1.0. Can’t imagine the upgrade is the problem but anyway... I get a NoClassDefFoundError for RandomGenerator when running a driver from the CLI. But only when using a named master, even a standalone master. If I run using master = local[4] the job

local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
I’ve read several discussions of the error here and so have wiped all cluster machines and copied the master’s spark build to the rest of the cluster. I’ve built my job on the master using the correct Spark version as a dependency and even build that version of Spark. I still get the

Re: local class incompatible: stream classdesc serialVersionUID

2014-10-16 Thread Pat Ferrel
the same version of Apache Spark on each node of the cluster? And I am not only asking about current project version (1.0.0, 1.1.0 etc.) but also about package type (hadoop 1.x, hadoop 2.x). On Fri, Oct 17, 2014 at 12:35 AM, Pat Ferrel p...@occamsmachete.com wrote: I’ve read several discussions

Running new code on a Spark Cluster

2014-06-26 Thread Pat Ferrel
I’ve created a CLI driver for a Spark version of a Mahout job called item similarity with several tests that all work fine on local[4] Spark standalone. The code even reads and writes to clustered HDFS. But switching to clustered Spark has a problem that seems tied to a broadcast and/or

Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote: Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
, concatenate their paths like that and pass the single string to textFile(). Nick On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex

Read from list of files in parallel

2014-04-28 Thread Pat Ferrel
Warning noob question: The sc.textFile(URI) method seems to support reading from files in parallel but you have to supply some wildcard URI, which greatly limits how the storage is structured. Is there a simple way to pass in a URI list or is it an exercise left for the student?