Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.