Hi, dear user group: I recently try to use the parallelize method of SparkContext to slice original data into small pieces for further handling. Something like the below:
val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize) The size of my original testing data is 88 objects. I know the default value (if I don't specify the sparkPartitionSize value) of numSlices is is 10. What happens is that when I specified the numSlices value to be 2 (As I use 2 slave nodes), when I do something like this: println("partitionedSource.count: " + partitionedSource.count) The output is partitionedSource.count: 44. The subtask though, is correctly created as 2. My intention is the get two slices where each slice has 44 objects and thus partitionedSource.count should be 2, isn't it? So, does this actually result 44 mean that I have 44 slices or 44 objects in each slice? How can the second case be? What if I have 89 objects? Maybe I didn't use it correctly? Can somebody help me on this? Thanks, Xiao Bing