Yes, using mapPartitionsWithIndex, e.g. in PySpark: >>> sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect() [(0, 250), (1, 250), (2, 250), (3, 250)]
(This is not the most efficient way to get the length of an iterator, but you get the idea...) Best, -Sven On Mon, Jan 12, 2015 at 6:54 PM, Kevin Burton <bur...@spinn3r.com> wrote: > Is there a way to compute the total number of records in each RDD > partition? > > So say I had 4 partitions.. I’d want to have > > partition 0: 100 records > partition 1: 104 records > partition 2: 90 records > partition 3: 140 records > > Kevin > > -- > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com> > > -- http://sites.google.com/site/krasser/?utm_source=sig