Yes, I saw that sentence too. But it's rather short and not very explanatory, and there doesn't seem to be any further info available anywhere that expands on it.
When I parse out that sentence: 1) "Kafka is not transactional" - i.e., the commits are done asynchronously, not synchronously. 2) "so your outputs must still be idempotent" - some of your commits may duplicate/overlap, so you need to be able to handle processing the same event(s) more than once. That doesn't quite make sense to me though. I don't quite understand why #1 implies #2. Yes, Kafka isn't transactional - i.e., doesn't process my commits synchronously. But it should be processing my commits *eventually*. If you look at my output from the previous message, even though I called commitAsync on 250959 -> 250962 in the first job, Kafka never actually processed those commits. That's not an eventual/asynchronous commit; that's an optional commit. Is that in fact the semantics here - i.e., calls to commitAsync are not actually guaranteed to succeed? If that's the case, the docs could really be a *lot* clearer about that. Thanks, DR On Fri, Apr 28, 2017 at 11:34 AM, Cody Koeninger <c...@koeninger.org> wrote: > From that doc: > > " However, Kafka is not transactional, so your outputs must still be > idempotent. " > > > > On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch <daro...@gmail.com> > wrote: > > I'm doing a POC to test recovery with spark streaming from Kafka. I'm > using > > the technique for storing the offsets in Kafka, as described at: > > > > https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10- > integration.html#kafka-itself > > > > I.e., grabbing the list of offsets before I start processing a batch of > > RDD's, and then committing them when I'm done. The process pretty much > > works: when I shut down my streaming process and then start it up > again, it > > pretty much picks up where it left off. > > > > However, it looks like there's some overlap happening, where a few of the > > messages are being processed by both the old and the new streaming job > runs. > > I.e., see the following log messages: > > > > End of old job run: > > 17/04/27 20:04:40 INFO KafkaRecoveryTester$: Committing rdd offsets: > > OffsetRange(topic: 'Datalake', partition: 0, range: [250959 -> > > 250962]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 -> > > 18]);OffsetRange(topic: 'Datalake', partition: 3, range: [15 -> > > 18]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 -> 17]) > > > > Start of new job run: > > 17/04/27 20:56:50 INFO KafkaRecoveryTester$: Processing rdd with offsets: > > OffsetRange(topic: 'Datalake', partition: 3, range: [15 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 0, range: [250959 -> > > 251044]) > > > > > > Notice that in partition 0, for example, the 3 messages with offsets > 250959 > > through 250961 are being processed twice - once by the old job, and once > by > > the new. I would have expected that in the new run, the offset range for > > partition 0 would have been 250962 -> 251044, which would result in > > exactly-once semantics. > > > > Am I misunderstanding how this should work? (I.e., exactly-once > semantics > > is not possible here?) > > > > Thanks, > > > > DR >