There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Mark,
>
> This appears to be a Scala-only feature. :(
>
> Patrick,
>
> Are we planning to add this to PySpark?
>
> Nick
>
>
> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com>wrote:
>
>> It's much simpler: rdd.partitions.size
>>
>>
>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hey there fellow Dukes of Data,
>>>
>>> How can I tell how many partitions my RDD is split into?
>>>
>>> I'm interested in knowing because, from what I gather, having a good
>>> number of partitions is good for performance. If I'm looking to understand
>>> how my pipeline is performing, say for a parallelized write out to HDFS,
>>> knowing how many partitions an RDD has would be a good thing to check.
>>>
>>> Is that correct?
>>>
>>> I could not find an obvious method or property to see how my RDD is
>>> partitioned. Instead, I devised the following thingy:
>>>
>>> def f(idx, itr): yield idx
>>>
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> rdd.mapPartitionsWithIndex(f).count()
>>>
>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>>> answer I'm looking for. Derp. :)
>>>
>>> So in summary, should I care about how finely my RDDs are partitioned?
>>> And how would I check on that?
>>>
>>> Nick
>>>
>>>
>>> ------------------------------
>>> View this message in context: How many partitions is my RDD split 
>>> into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
>>> Sent from the Apache Spark User List mailing list 
>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>
>>
>>
>

Reply via email to