Issue with the parallelize method in SparkContext

Wisc Forum Sat, 24 May 2014 07:16:31 -0700

Hi, dear user group:

I recently try to use the parallelize method of SparkContext to slice original 
data into small pieces for further handling. Something like the below:


val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize)

The size of my original testing data is 88 objects.

I know the default value (if I don't specify the sparkPartitionSize  value) of 
numSlices is is 10. 

What happens is that when I specified the numSlices value to be 2 (As I use 2 
slave nodes), when I do something like this:
println("partitionedSource.count: " + partitionedSource.count)


The output is partitionedSource.count: 44. The subtask though, is correctly 
created as 2.

My intention is the get two slices where each slice has 44 objects and thus 
partitionedSource.count should be 2, isn't it? So, does this actually result 44 
mean that I have 44 slices or 44 objects in each slice? How can the second case 
be? What if I have 89 objects? Maybe I didn't use it correctly?

Can somebody help me on this?

Thanks,
Xiao Bing

Issue with the parallelize method in SparkContext

Reply via email to