Oh, glad to know it's that simple! Patrick, in your last comment did you mean filter in? As in I start with one year of data and filter it so I have one day left? I'm assuming in that case the empty partitions would be for all the days that got filtered out.
Nick 2014년 3월 24일 월요일, Patrick Wendell<pwend...@gmail.com>님이 작성한 메시지: > As Mark said you can actually access this easily. The main issue I've > seen from a performance perspective is people having a bunch of really > small partitions. This will still work but the performance will > improve if you coalesce the partitions using rdd.coalesce(). > > This can happen for example if you do a highly selective filter on an > RDD. For instance, you filter out one day of data from a dataset of a > year. > > - Patrick > > On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra > <m...@clearstorydata.com<javascript:;>> > wrote: > > It's much simpler: rdd.partitions.size > > > > > > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas > > <nicholas.cham...@gmail.com <javascript:;>> wrote: > >> > >> Hey there fellow Dukes of Data, > >> > >> How can I tell how many partitions my RDD is split into? > >> > >> I'm interested in knowing because, from what I gather, having a good > >> number of partitions is good for performance. If I'm looking to > understand > >> how my pipeline is performing, say for a parallelized write out to HDFS, > >> knowing how many partitions an RDD has would be a good thing to check. > >> > >> Is that correct? > >> > >> I could not find an obvious method or property to see how my RDD is > >> partitioned. Instead, I devised the following thingy: > >> > >> def f(idx, itr): yield idx > >> > >> rdd = sc.parallelize([1, 2, 3, 4], 4) > >> rdd.mapPartitionsWithIndex(f).count() > >> > >> Frankly, I'm not sure what I'm doing here, but this seems to give me the > >> answer I'm looking for. Derp. :) > >> > >> So in summary, should I care about how finely my RDDs are partitioned? > And > >> how would I check on that? > >> > >> Nick > >> > >> > >> ________________________________ > >> View this message in context: How many partitions is my RDD split into? > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > >