Currently if you use accumulators inside actions (like foreach) you have
guarantee that, even if partition will be recalculated, the values will be
correct. Same thing does NOT apply to transformations and you can not
relay 100% on the values.
Pawel Szulc
pt., 27 lut 2015, 4:54 PM Darin McBeath
Hey Darin,
Record count metrics are coming in Spark 1.3. Can you wait until it is
released? Or do you need a solution in older versions of spark.
Kostas
On Friday, February 27, 2015, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
I have a fairly large Spark job where I'm essentially
@spark.apache.org
Sent: Friday, February 27, 2015 12:19 PM
Subject: Re: Question about Spark best practice when counting records.
Hey Darin,
Record count metrics are coming in Spark 1.3. Can you wait until it is
released? Or do you need a solution in older versions of spark.
Kostas
On Friday, February 27
I have a fairly large Spark job where I'm essentially creating quite a few
RDDs, do several types of joins using these RDDS resulting in a final RDD which
I write back to S3.
Along the way, I would like to capture record counts for some of these RDDs. My
initial approach was to use the count