Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out to HDFS, knowing
how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned? And
how would I check on that?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to