Sweet! That's simple enough.

Here's a JIRA ticket to track adding this to PySpark for the future:

https://spark-project.atlassian.net/browse/SPARK-1308

Nick


On Mon, Mar 24, 2014 at 4:29 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> Ah we should just add this directly in pyspark - it's as simple as the
> code Shivaram just wrote.
>
> - Patrick
>
> On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
> <shivaram.venkatara...@gmail.com> wrote:
> > There is no direct way to get this in pyspark, but you can get it from
> the
> > underlying java rdd. For example
> >
> > a = sc.parallelize([1,2,3,4], 2)
> > a._jrdd.splits().size()
> >
> >
> > On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
> > <nicholas.cham...@gmail.com> wrote:
> >>
> >> Mark,
> >>
> >> This appears to be a Scala-only feature. :(
> >>
> >> Patrick,
> >>
> >> Are we planning to add this to PySpark?
> >>
> >> Nick
> >>
> >>
> >> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com
> >
> >> wrote:
> >>>
> >>> It's much simpler: rdd.partitions.size
> >>>
> >>>
> >>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> >>> <nicholas.cham...@gmail.com> wrote:
> >>>>
> >>>> Hey there fellow Dukes of Data,
> >>>>
> >>>> How can I tell how many partitions my RDD is split into?
> >>>>
> >>>> I'm interested in knowing because, from what I gather, having a good
> >>>> number of partitions is good for performance. If I'm looking to
> understand
> >>>> how my pipeline is performing, say for a parallelized write out to
> HDFS,
> >>>> knowing how many partitions an RDD has would be a good thing to check.
> >>>>
> >>>> Is that correct?
> >>>>
> >>>> I could not find an obvious method or property to see how my RDD is
> >>>> partitioned. Instead, I devised the following thingy:
> >>>>
> >>>> def f(idx, itr): yield idx
> >>>>
> >>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
> >>>> rdd.mapPartitionsWithIndex(f).count()
> >>>>
> >>>> Frankly, I'm not sure what I'm doing here, but this seems to give me
> the
> >>>> answer I'm looking for. Derp. :)
> >>>>
> >>>> So in summary, should I care about how finely my RDDs are partitioned?
> >>>> And how would I check on that?
> >>>>
> >>>> Nick
> >>>>
> >>>>
> >>>> ________________________________
> >>>> View this message in context: How many partitions is my RDD split
> into?
> >>>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>
> >
>

Reply via email to