Re: too many small files and task

2014-12-19 Thread bethesda
I recently had the same problem. I'm not an expert but will suggest that you concatenate your files into a smaller number of larger files. E.g. in Linux cat files a_larger_file. This helped greatly. Likely others better qualified will weigh in on this later but that's something to get you

Re: reading files recursively using spark

2014-12-19 Thread bethesda
On hdfs I created: /one/one.txt # contains text one /one/two/two.txt # contains text two Then: val data = sc.textFile(/one/*) data.collect This returned: Array(one, two) So the above path designation appears to automatically recurse for you. -- View this message in context:

Fetch Failure

2014-12-19 Thread bethesda
I have a job that runs fine on relatively small input datasets but then reaches a threshold where I begin to consistently get Fetch failure for the Failure Reason, late in the job, during a saveAsText() operation. The first error we are seeing on the Details for Stage page is ExecutorLostFailure

Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread bethesda
We have a very large RDD and I need to create a new RDD whose values are derived from each record of the original RDD, and we only retain the few new records that meet a criteria. I want to avoid creating a second large RDD and then filtering it since I believe this could tax system resources

Why so many tasks?

2014-12-16 Thread bethesda
Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks are running, perhaps because we are

Re: Why so many tasks?

2014-12-16 Thread bethesda
Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html Sent from the Apache Spark User List mailing list archive

Appending an incrental value to each RDD record

2014-12-16 Thread bethesda
I think this is sort of a newbie question, but I've checked the api closely and don't see an obvious answer: Given an RDD, how would I create a new RDD of Tuples where the first Tuple value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple is the original RDD record? I'm

Re: Appending an incrental value to each RDD record

2014-12-16 Thread bethesda
Thanks! zipWithIndex() works well. I had overlooked it because the name 'zip' is rather odd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html Sent from the Apache Spark User List mailing

Best practice for multi-user web controller in front of Spark

2014-11-11 Thread bethesda
We are relatively new to spark and so far have been manually submitting single jobs at a time for ML training, during our development process, using spark-submit. Each job accepts a small user-submitted data set and compares it to every data set in our hdfs corpus, which only changes