It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Hey there fellow Dukes of Data, > > How can I tell how many partitions my RDD is split into? > > I'm interested in knowing because, from what I gather, having a good > number of partitions is good for performance. If I'm looking to understand > how my pipeline is performing, say for a parallelized write out to HDFS, > knowing how many partitions an RDD has would be a good thing to check. > > Is that correct? > > I could not find an obvious method or property to see how my RDD is > partitioned. Instead, I devised the following thingy: > > def f(idx, itr): yield idx > > rdd = sc.parallelize([1, 2, 3, 4], 4) > rdd.mapPartitionsWithIndex(f).count() > > Frankly, I'm not sure what I'm doing here, but this seems to give me the > answer I'm looking for. Derp. :) > > So in summary, should I care about how finely my RDDs are partitioned? And > how would I check on that? > > Nick > > > ------------------------------ > View this message in context: How many partitions is my RDD split > into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html> > Sent from the Apache Spark User List mailing list > archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >