Why sc.objectFile(...) return a Rdd with thousands of partitions?
I save a rdd to file system using
rdd.saveAsObjectFile("file:///tmp/mydir")
Note that the rdd contains 7 millions object. I check the directory
/tmp/mydir/, it contains 8 partitions
part-00000 part-00002 part-00004 part-00006 _SUCCESS
part-00001 part-00003 part-00005 part-00007
I then load the rdd back using
val rdd2 = sc.objectFile[LabeledPoint]( ("file:///tmp/mydir", 8)
I expect rdd2 to have 8 partitions. But from the master UI, I see that rdd2 has
over 1000 partitions. This is very inefficient. How can I limit it to 8
partitions just like what is stored on the file system?
Regards,
Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541