I recently had the same problem. I'm not an expert but will suggest that you
concatenate your files into a smaller number of larger files. E.g. in Linux
cat files a_larger_file. This helped greatly.
Likely others better qualified will weigh in on this later but that's
something to get you
On hdfs I created:
/one/one.txt # contains text one
/one/two/two.txt # contains text two
Then:
val data = sc.textFile(/one/*)
data.collect
This returned:
Array(one, two)
So the above path designation appears to automatically recurse for you.
--
View this message in context:
I have a job that runs fine on relatively small input datasets but then
reaches a threshold where I begin to consistently get Fetch failure for
the Failure Reason, late in the job, during a saveAsText() operation.
The first error we are seeing on the Details for Stage page is
ExecutorLostFailure
We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria. I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
Our job is creating what appears to be an inordinate number of very small
tasks, which blow out our os inode and file limits. Rather than continually
upping those limits, we are seeking to understand whether our real problem
is that too many tasks are running, perhaps because we are
Thank you! I had known about the small-files problem in HDFS but didn't
realize that it affected sc.textFile().
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html
Sent from the Apache Spark User List mailing list archive
I think this is sort of a newbie question, but I've checked the api closely
and don't see an obvious answer:
Given an RDD, how would I create a new RDD of Tuples where the first Tuple
value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple
is the original RDD record? I'm
Thanks! zipWithIndex() works well. I had overlooked it because the name
'zip' is rather odd
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html
Sent from the Apache Spark User List mailing
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit. Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes