Sorry, but the accumulator is still going to require you to walk through the
RDD to get an accurate count, right?
Its not being persisted?
On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote:
Alternative to doing a naive toArray is to declare an accumulator per
Hi again,
On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer t...@preferred.jp wrote:
If you think of
items.map(x = /* throw exception */).count()
then even though the count you want to get does not necessarily require
the evaluation of the function in map() (i.e., the number is the
Alternative to doing a naive toArray is to declare an accumulator per partition
and use that. It's specifically what they were designed to do. See the
programming guide.
Sent with Good (www.good.com)
-Original Message-
From: Tobias Pfeiffer
Hi,
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Use the mapPartitions function. It returns an iterator to each partition.
Then just get that length by converting to an array.
On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:
Yes, using mapPartitionsWithIndex, e.g. in PySpark:
sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda
idx,iter: ((idx, len(list(iter))),)).collect()
[(0, 250), (1, 250), (2, 250), (3, 250)]
(This is not the most efficient way to get the length of an iterator, but
you get the