As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce().
This can happen for example if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > It's much simpler: rdd.partitions.size > > > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: >> >> Hey there fellow Dukes of Data, >> >> How can I tell how many partitions my RDD is split into? >> >> I'm interested in knowing because, from what I gather, having a good >> number of partitions is good for performance. If I'm looking to understand >> how my pipeline is performing, say for a parallelized write out to HDFS, >> knowing how many partitions an RDD has would be a good thing to check. >> >> Is that correct? >> >> I could not find an obvious method or property to see how my RDD is >> partitioned. Instead, I devised the following thingy: >> >> def f(idx, itr): yield idx >> >> rdd = sc.parallelize([1, 2, 3, 4], 4) >> rdd.mapPartitionsWithIndex(f).count() >> >> Frankly, I'm not sure what I'm doing here, but this seems to give me the >> answer I'm looking for. Derp. :) >> >> So in summary, should I care about how finely my RDDs are partitioned? And >> how would I check on that? >> >> Nick >> >> >> ________________________________ >> View this message in context: How many partitions is my RDD split into? >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >