[ 
https://issues.apache.org/jira/browse/KAFKA-9552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037683#comment-17037683
 ] 

Matthias J. Sax edited comment on KAFKA-9552 at 11/23/20, 4:59 PM:
-------------------------------------------------------------------

I am not sure if we need to re-balance – if we would have missed a rebalance 
and lost the task, we would get a `ProducerFencedException`. Hence, on this 
error we should still be part of the consumer group.

>From my understanding an `OutOfOrderSequenceException` implies data loss, ie, 
>we got an ack back, but on the next send data is not in the log (this could 
>happen if unclean leader election is enabled broker side) – otherwise should 
>indicate a severe bug.

While we could abort the current transaction, and reinitialize the task (ie, 
refetch the input topic offsets, cleanup the state etc), I am wondering if we 
should do this as it would mask a bug? Instead, it might be better to not catch 
and fail fast thus we can report this error?

Btw: In `RecordCollectorImpl` in a recent PR we started to catch 
`OutOfOrderSequenceException` and rethrow `TaskMigratedException` for this case 
– however, I am not sure if we should keep this change or roll it back for the 
same reason.

\cc [~guozhang] [~hachikuji] [~vvcephei]


was (Author: mjsax):
I am not sure if we need to re-balance – if we would have missed a rebalance 
and lost the task, we would get a `ProducerFencedException`. Hence, on this 
error we should still be part of the consumer group.

>From my understanding an `OutOfOrderSequenceExceptio`n implies data loss, ie, 
>we got an ack back, but on the next send data is not in the log (this could 
>happen if unclean leader election is enabled broker side) – otherwise should 
>indicate a severe bug.

While we could abort the current transaction, and reinitialize the task (ie, 
refetch the input topic offsets, cleanup the state etc), I am wondering if we 
should do this as it would mask a bug? Instead, it might be better to not catch 
and fail fast thus we can report this error?

Btw: In `RecordCollectorImpl` in a recent PR we started to catch 
`OutOfOrderSequenceException` and rethrow `TaskMigratedException` for this case 
– however, I am not sure if we should keep this change or roll it back for the 
same reason.

\cc [~guozhang] [~hachikuji] [~vvcephei]

> Stream should handle OutOfSequence exception thrown from Producer
> -----------------------------------------------------------------
>
>                 Key: KAFKA-9552
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9552
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 2.5.0
>            Reporter: Boyang Chen
>            Priority: Major
>
> As of today the stream thread could die from OutOfSequence error:
> {code:java}
>  [2020-02-12T07:14:35-08:00] 
> (streams-soak-2-5-eos_soak_i-03f89b1e566ac95cc_streamslog) 
> org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker 
> received an out of order sequence number.
>  [2020-02-12T07:14:35-08:00] 
> (streams-soak-2-5-eos_soak_i-03f89b1e566ac95cc_streamslog) [2020-02-12 
> 15:14:35,185] ERROR 
> [stream-soak-test-546f8754-5991-4d62-8565-dbe98d51638e-StreamThread-1] 
> stream-thread 
> [stream-soak-test-546f8754-5991-4d62-8565-dbe98d51638e-StreamThread-1] Failed 
> to commit stream task 3_2 due to the following error: 
> (org.apache.kafka.streams.processor.internals.AssignedStreamsTasks)
>  [2020-02-12T07:14:35-08:00] 
> (streams-soak-2-5-eos_soak_i-03f89b1e566ac95cc_streamslog) 
> org.apache.kafka.streams.errors.StreamsException: task [3_2] Abort sending 
> since an error caught with a previous record (timestamp 1581484094825) to 
> topic stream-soak-test-KSTREAM-AGGREGATE-STATE-STORE-0000000049-changelog due 
> to org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker 
> received an out of order sequence number.
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:154)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:52)
>  at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:214)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1353)
> {code}
>  Although this is fatal exception for Producer, stream should treat it as an 
> opportunity to reinitialize by doing a rebalance, instead of killing 
> computation resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to