Sean You said
Ø If you know that this number is too high you can request a number of partitions when you read it. How to do that? Can you give a code snippet? I want to read it into 8 partitions, so I do val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir<file:///\\tmp\mydir>”, 8) However rdd2 contains thousands of partitions instead of 8 partitions Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, January 21, 2015 2:32 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: sparkcontext.objectFile return thousands of partitions You have 8 files, not 8 partitions. It does not follow that they should be read as 8 partitions since they are presumably large and so you would be stuck using at most 8 tasks in parallel to process. The number of partitions is determined by Hadoop input splits and generally makes a partition per block of data. If you know that this number is too high you can request a number of partitions when you read it. Don't coalesce, just read the desired number from the start. On Jan 21, 2015 4:32 PM, "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote: Why sc.objectFile(…) return a Rdd with thousands of partitions? I save a rdd to file system using rdd.saveAsObjectFile(“file:///tmp/mydir<file:///\\tmp\mydir>”) Note that the rdd contains 7 millions object. I check the directory /tmp/mydir/, it contains 8 partitions part-00000 part-00002 part-00004 part-00006 _SUCCESS part-00001 part-00003 part-00005 part-00007 I then load the rdd back using val rdd2 = sc.objectFile[LabeledPoint]( (“file:///tmp/mydir<file:///\\tmp\mydir>”, 8) I expect rdd2 to have 8 partitions. But from the master UI, I see that rdd2 has over 1000 partitions. This is very inefficient. How can I limit it to 8 partitions just like what is stored on the file system? Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541