: user@spark.apache.org
Subject: Re: sparkcontext.objectFile return thousands of partitions
You have 8 files, not 8 partitions. It does not follow that they should be read
as 8 partitions since they are presumably large and so you would be stuck using
at most 8 tasks in parallel to process
I think you should also just be able to provide an input format that never
splits the input data. This has come up before on the list, but I couldn't
find it.*
I think this should work, but I can't try it out at the moment. Can you
please try and let us know if it works?
class
Yes, that second argument is what I was referring to, but yes it's a
*minimum*, oops, right. OK, you will want to coalesce then, indeed.
On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Ø If you know that this number is too high you can request a
You have 8 files, not 8 partitions. It does not follow that they should be
read as 8 partitions since they are presumably large and so you would be
stuck using at most 8 tasks in parallel to process. The number of
partitions is determined by Hadoop input splits and generally makes a
partition per
maybe each of the file parts has many blocks?
did you try SparkContext.coalesce to reduce the number of partitions? can
be done w/ or w/o data-shuffle.
*Noam Barcay*
Developer // *Kenshoo*
*Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142
__
Why sc.objectFile(...) return a Rdd with thousands of partitions?
I save a rdd to file system using
rdd.saveAsObjectFile(file:///tmp/mydir)
Note that the rdd contains 7 millions object. I check the directory
/tmp/mydir/, it contains 8 partitions
part-0 part-2 part-4 part-6