Currently if you use accumulators inside actions (like foreach) you have
guarantee that, even if partition will be recalculated, the values will be
correct.  Same thing does NOT apply to transformations and you can not
relay 100% on the values.

Pawel Szulc

pt., 27 lut 2015, 4:54 PM Darin McBeath użytkownik
<ddmcbe...@yahoo.com.invalid> napisał:

> I have a fairly large Spark job where I'm essentially creating quite a few
> RDDs, do several types of joins using these RDDS resulting in a final RDD
> which I write back to S3.
>
>
> Along the way, I would like to capture record counts for some of these
> RDDs. My initial approach was to use the count action on some of these
> intermediate  RDDS (and cache them since the count would force the
> materialization of the RDD and the RDD would be needed again later).  This
> seemed to work 'ok' when my RDDs were fairly small/modest but as they grew
> in size I started to experience problems.
>
> After watching a recent very good screencast on performance, this doesn't
> seem the correct approach as I believe I'm really breaking (or hindering)
> the pipelining concept in Spark.  If I remove all of my  counts, I'm only
> left with the one job/action (save as Hadoop file at the end).  Spark then
> seems to run smoother (and quite a bit faster) and I really don't need (or
> want) to even cache any of my intermediate RDDs.
>
> So, the approach I've been kicking around is to use accumulators instead.
> I was already using them to count 'bad' records but why not 'good' records
> as well? I realize that if I lose a partition that I might over count, but
> perhaps that is an acceptable trade-off.
>
> I'm guessing that others have ran into this before so I would like to
> learn from the experience of others and how they have addressed this.
>
> Thanks.
>
> Darin.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to