I'll join Mark and Aljoscha to clarify Spark's guarantees (AFAIK).
In Spark, Accumulators (Spark's counters/Aggregators) will count exactly
once
if they are used in an "Action Task" (i.e., foreach), but because of
Spark's lazy execution
accumulators that are used in lazy transformations (i.e., map) might be
recounted if the entire lineage is recomputed.

On Wed, Mar 23, 2016 at 2:13 PM William McCarthy <[email protected]>
wrote:

> Thanks Aljoscha and Max,
>
> The fact that this only happens in failure scenarios is good to know,
> thanks.
>
> Perhaps I'm unclear on what an “Aggregator” is. I assumed that a line such
> as the following:
>
> PCollection<KV<String, Double>> meanByName =
> dataPoints.apply(Mean.<String, Double>perKey());
>
> …would be considered an Aggregator, since it applies a mean aggregation
> over a window. Is that correct, with respect to the Beam terminology? If
> not, what would an example of an Aggregator be?
>
> Thanks,
>
> Bill
>
> On Mar 22, 2016, at 2:18 PM, Mark Shields <[email protected]> wrote:
>
> Google's streaming implementation has the same property: counters are not
> committed with work and so updates may sometimes be lost (ie undercounted),
> or may be replayed (ie overcounted). It's a tradeoff between having
> low-latency and cheep monitoring against coherence with the underlying
> processing.
>
> On Tue, Mar 22, 2016 at 1:57 AM, Aljoscha Krettek <[email protected]>
> wrote:
>
>> Hi,
>> in Flink the accumulators/aggregators are not faul-tolerant. In case of a
>> failure the job will be restarted but the accumulators will start from
>> scratch. Initially they were only meant as a rough way to gauge the
>> progress that a job is making. People should not rely on them for accurate
>> numbers right now.
>>
>> Cheers,
>> Aljoscha
>> > On 21 Mar 2016, at 20:37, William McCarthy <[email protected]>
>> wrote:
>> >
>> > Hi,
>> >
>> > I just had a look at the capability matrix here:
>> http://beam.incubator.apache.org/capability-matrix/ . I really like it,
>> as it gives a nice summary of the current state of implementation
>> completeness for the different runners.
>> >
>> > I had one follow-up question, regarding the cell at the intersection of
>> the Aggregators row and the Apache Flink column, with this content: "In
>> streaming mode, Aggregators may undercount”. Can you give me some ideas
>> about what this means? In what circumstances might this happen? Are there
>> some mitigation strategies that are appropriate?
>> >
>> > Thanks,
>> >
>> > Bill
>>
>>
>
>

Reply via email to