Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel
Sorry, but the accumulator is still going to require you to walk through the RDD to get an accurate count, right? Its not being persisted? On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Alternative to doing a naive toArray is to declare an accumulator per

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer t...@preferred.jp wrote: If you think of items.map(x = /* throw exception */).count() then even though the count you want to get does not necessarily require the evaluation of the function in map() (i.e., the number is the

RE: quickly counting the number of rows in a partition?

2015-01-13 Thread Ganelin, Ilya
Alternative to doing a naive toArray is to declare an accumulator per partition and use that. It's specifically what they were designed to do. See the programming guide. Sent with Good (www.good.com) -Original Message- From: Tobias Pfeiffer

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Use the mapPartitions function. It returns an iterator to each partition. Then just get that length by converting to an array. On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:

quickly counting the number of rows in a partition?

2015-01-12 Thread Kevin Burton
Is there a way to compute the total number of records in each RDD partition? So say I had 4 partitions.. I’d want to have partition 0: 100 records partition 1: 104 records partition 2: 90 records partition 3: 140 records Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog:

Re: quickly counting the number of rows in a partition?

2015-01-12 Thread Sven Krasser
Yes, using mapPartitionsWithIndex, e.g. in PySpark: sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda idx,iter: ((idx, len(list(iter))),)).collect() [(0, 250), (1, 250), (2, 250), (3, 250)] (This is not the most efficient way to get the length of an iterator, but you get the