Yes, I saw that sentence too.  But it's rather short and not very
explanatory, and there doesn't seem to be any further info available
anywhere that expands on it.

When I parse out that sentence:

1) "Kafka is not transactional" - i.e., the commits are done
asynchronously, not synchronously.
2) "so your outputs must still be idempotent" - some of your commits may
duplicate/overlap, so you need to be able to handle processing the same
event(s) more than once.

That doesn't quite make sense to me though.  I don't quite understand why
#1 implies #2.  Yes, Kafka isn't transactional - i.e., doesn't process my
commits synchronously.  But it should be processing my commits
*eventually*.  If you look at my output from the previous message, even
though I called commitAsync on 250959 -> 250962 in the first job, Kafka
never actually processed those commits.  That's not an
eventual/asynchronous commit; that's an optional commit.

Is that in fact the semantics here - i.e., calls to commitAsync are not
actually guaranteed to succeed?  If that's the case, the docs could really
be a *lot* clearer about that.

Thanks,

DR

On Fri, Apr 28, 2017 at 11:34 AM, Cody Koeninger <c...@koeninger.org> wrote:

> From that doc:
>
> " However, Kafka is not transactional, so your outputs must still be
> idempotent. "
>
>
>
> On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch <daro...@gmail.com>
> wrote:
> > I'm doing a POC to test recovery with spark streaming from Kafka.  I'm
> using
> > the technique for storing the offsets in Kafka, as described at:
> >
> > https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-
> integration.html#kafka-itself
> >
> > I.e., grabbing the list of offsets before I start processing a batch of
> > RDD's, and then committing them when I'm done.  The process pretty much
> > works:  when I shut down my streaming process and then start it up
> again, it
> > pretty much picks up where it left off.
> >
> > However, it looks like there's some overlap happening, where a few of the
> > messages are being processed by both the old and the new streaming job
> runs.
> > I.e., see the following log messages:
> >
> > End of old job run:
> > 17/04/27 20:04:40 INFO KafkaRecoveryTester$: Committing rdd offsets:
> > OffsetRange(topic: 'Datalake', partition: 0, range: [250959 ->
> > 250962]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 ->
> > 18]);OffsetRange(topic: 'Datalake', partition: 3, range: [15 ->
> > 18]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 -> 17])
> >
> > Start of new job run:
> > 17/04/27 20:56:50 INFO KafkaRecoveryTester$: Processing rdd with offsets:
> > OffsetRange(topic: 'Datalake', partition: 3, range: [15 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 0, range: [250959 ->
> > 251044])
> >
> >
> > Notice that in partition 0, for example, the 3 messages with offsets
> 250959
> > through 250961 are being processed twice - once by the old job, and once
> by
> > the new.  I would have expected that in the new run, the offset range for
> > partition 0 would have been 250962 -> 251044, which would result in
> > exactly-once semantics.
> >
> > Am I misunderstanding how this should work?  (I.e., exactly-once
> semantics
> > is not possible here?)
> >
> > Thanks,
> >
> > DR
>

Reply via email to