Oh, glad to know it's that simple!

Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.

Nick

2014년 3월 24일 월요일, Patrick Wendell<pwend...@gmail.com>님이 작성한 메시지:

> As Mark said you can actually access this easily. The main issue I've
> seen from a performance perspective is people having a bunch of really
> small partitions. This will still work but the performance will
> improve if you coalesce the partitions using rdd.coalesce().
>
> This can happen for example if you do a highly selective filter on an
> RDD. For instance, you filter out one day of data from a dataset of a
> year.
>
> - Patrick
>
> On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra 
> <m...@clearstorydata.com<javascript:;>>
> wrote:
> > It's much simpler: rdd.partitions.size
> >
> >
> > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> > <nicholas.cham...@gmail.com <javascript:;>> wrote:
> >>
> >> Hey there fellow Dukes of Data,
> >>
> >> How can I tell how many partitions my RDD is split into?
> >>
> >> I'm interested in knowing because, from what I gather, having a good
> >> number of partitions is good for performance. If I'm looking to
> understand
> >> how my pipeline is performing, say for a parallelized write out to HDFS,
> >> knowing how many partitions an RDD has would be a good thing to check.
> >>
> >> Is that correct?
> >>
> >> I could not find an obvious method or property to see how my RDD is
> >> partitioned. Instead, I devised the following thingy:
> >>
> >> def f(idx, itr): yield idx
> >>
> >> rdd = sc.parallelize([1, 2, 3, 4], 4)
> >> rdd.mapPartitionsWithIndex(f).count()
> >>
> >> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> >> answer I'm looking for. Derp. :)
> >>
> >> So in summary, should I care about how finely my RDDs are partitioned?
> And
> >> how would I check on that?
> >>
> >> Nick
> >>
> >>
> >> ________________________________
> >> View this message in context: How many partitions is my RDD split into?
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>

Reply via email to