There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example
a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Mark, > > This appears to be a Scala-only feature. :( > > Patrick, > > Are we planning to add this to PySpark? > > Nick > > > On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com>wrote: > >> It's much simpler: rdd.partitions.size >> >> >> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Hey there fellow Dukes of Data, >>> >>> How can I tell how many partitions my RDD is split into? >>> >>> I'm interested in knowing because, from what I gather, having a good >>> number of partitions is good for performance. If I'm looking to understand >>> how my pipeline is performing, say for a parallelized write out to HDFS, >>> knowing how many partitions an RDD has would be a good thing to check. >>> >>> Is that correct? >>> >>> I could not find an obvious method or property to see how my RDD is >>> partitioned. Instead, I devised the following thingy: >>> >>> def f(idx, itr): yield idx >>> >>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>> rdd.mapPartitionsWithIndex(f).count() >>> >>> Frankly, I'm not sure what I'm doing here, but this seems to give me the >>> answer I'm looking for. Derp. :) >>> >>> So in summary, should I care about how finely my RDDs are partitioned? >>> And how would I check on that? >>> >>> Nick >>> >>> >>> ------------------------------ >>> View this message in context: How many partitions is my RDD split >>> into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html> >>> Sent from the Apache Spark User List mailing list >>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >>> >> >> >