RE: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Wang, Ningjun (LNG-NPV)
: user@spark.apache.org Subject: Re: sparkcontext.objectFile return thousands of partitions You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process

Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Imran Rashid
I think you should also just be able to provide an input format that never splits the input data. This has come up before on the list, but I couldn't find it.* I think this should work, but I can't try it out at the moment. Can you please try and let us know if it works? class

Re: sparkcontext.objectFile return thousands of partitions

2015-01-22 Thread Sean Owen
Yes, that second argument is what I was referring to, but yes it's a *minimum*, oops, right. OK, you will want to coalesce then, indeed. On Thu, Jan 22, 2015 at 6:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø If you know that this number is too high you can request a

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Sean Owen
You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per

Re: sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Noam Barcay
maybe each of the file parts has many blocks? did you try SparkContext.coalesce to reduce the number of partitions? can be done w/ or w/o data-shuffle. *Noam Barcay* Developer // *Kenshoo* *Office* +972 3 746-6500 *427 // *Mobile* +972 54 475-3142 __

sparkcontext.objectFile return thousands of partitions

2015-01-21 Thread Wang, Ningjun (LNG-NPV)
Why sc.objectFile(...) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile(file:///tmp/mydir) Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-0 part-2 part-4 part-6