Re: How many partitions is my RDD split into?

Nicholas Chammas Mon, 24 Mar 2014 07:47:39 -0700

Mark,

This appears to be a Scala-only feature. :(


Patrick,

Are we planning to add this to PySpark?

Nick


On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com>wrote:

> It's much simpler: rdd.partitions.size
>
>
> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hey there fellow Dukes of Data,
>>
>> How can I tell how many partitions my RDD is split into?
>>
>> I'm interested in knowing because, from what I gather, having a good
>> number of partitions is good for performance. If I'm looking to understand
>> how my pipeline is performing, say for a parallelized write out to HDFS,
>> knowing how many partitions an RDD has would be a good thing to check.
>>
>> Is that correct?
>>
>> I could not find an obvious method or property to see how my RDD is
>> partitioned. Instead, I devised the following thingy:
>>
>> def f(idx, itr): yield idx
>>
>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>> rdd.mapPartitionsWithIndex(f).count()
>>
>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>> answer I'm looking for. Derp. :)
>>
>> So in summary, should I care about how finely my RDDs are partitioned?
>> And how would I check on that?
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: How many partitions is my RDD split 
>> into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: How many partitions is my RDD split into?

Reply via email to