Re: Question about Spark best practice when counting records.

2015-02-27 Thread Paweł Szulc
Currently if you use accumulators inside actions (like foreach) you have guarantee that, even if partition will be recalculated, the values will be correct. Same thing does NOT apply to transformations and you can not relay 100% on the values. Pawel Szulc pt., 27 lut 2015, 4:54 PM Darin McBeath

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Kostas Sakellis
Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27, 2015, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: I have a fairly large Spark job where I'm essentially

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
@spark.apache.org Sent: Friday, February 27, 2015 12:19 PM Subject: Re: Question about Spark best practice when counting records. Hey Darin, Record count metrics are coming in Spark 1.3. Can you wait until it is released? Or do you need a solution in older versions of spark. Kostas On Friday, February 27

Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
I have a fairly large Spark job where I'm essentially creating quite a few RDDs, do several types of joins using these RDDS resulting in a final RDD which I write back to S3. Along the way, I would like to capture record counts for some of these RDDs. My initial approach was to use the count