Matthias, Good to hear on this part that kafka streams handle this internally : "If a rebalance/shutdown is triggered, Kafka Streams will stop processing new records and just finish processing all in-flight records. Afterwards, a commit happens right away for all fully processed records."
Since regular consumer-producer doesn't support this I guess, even in case of normal shutdown there could be duplicates. Is that correct or kafka has support for normal consumer-producer to handle in-flight processing and commit those offsets before a rebalance/shutdown occurs? On Wed, Oct 6, 2021 at 12:22 AM Matthias J. Sax <mj...@apache.org> wrote: > > - By producer config, i hope you mean batching and other settings that > will > > hold off producing of events. Correct me if i'm wrong > > Correct. > > > - Not sure what you mean by throughput here, which configuration would > > dictate that? > > I referred to input topic throughput. If you have higher/lower > throughput you might get data quicker/later depending on your producer > configs. > > > - Do you mean here that the kafka streams internally handles waiting on > > processing and offset commits of events that are already consumed and > being > > processed for streams instance? > > If a rebalance/shutdown is triggered, Kafka Streams will stop processing > new records and just finish processing all in-flight records. > Afterwards, a commit happens right away for all fully processed records. > > > -Matthias > > > On 10/5/21 8:35 AM, Pushkar Deole wrote: > > Matthias, > > > > On your response "For at-least-once, you would still get output > > continuously, depending on throughput and producer configs" > > - Not sure what you mean by throughput here, which configuration would > > dictate that? > > - By producer config, i hope you mean batching and other settings that > will > > hold off producing of events. Correct me if i'm wrong > > > > On your response "For regular rebalances/restarts, a longer commit > interval > > has no impact because offsets would be committed right away" > > - Do you mean here that the kafka streams internally handles waiting on > > processing and offset commits of events that are already consumed and > being > > processed for streams instance? > > > > On Tue, Oct 5, 2021 at 11:43 AM Matthias J. Sax <mj...@apache.org> > wrote: > > > >> The main motivation for a shorter commit interval for EOS is > >> end-to-end-latency. A Topology could consist of multiple sub-topologies > >> and the end-to-end-latency for the EOS case is roughly commit-interval > >> times number-of-subtopologies. > >> > >> For regular rebalances/restarts, a longer commit interval has no impact, > >> because for a regular rebalance/restart, offsets would be committed > >> right away to guarantee a clean hand-off. Only in case of failure, a > >> longer commit interval can lead to larger amount of duplicates (of > >> course only for at-least-once guarantees). > >> > >> For at-least-once, you would still get output continuously, depending on > >> throughput and producer configs. Only offsets are committed each 30 > >> seconds by default. This continuous output is also the reason why there > >> is not latency impact for at-least-once using a longer commit interval. > >> > >> Beside an impact on latency, there is also a throughput impact. Using a > >> longer commit interval provides higher throughput. > >> > >> > >> -Matthias > >> > >> > >> On 10/4/21 7:31 AM, Pushkar Deole wrote: > >>> Hi All, > >>> > >>> I am looking into the commit.interval.ms in kafka streams which says > >> that > >>> it is the time interval at which streams would commit offsets to source > >>> topics. > >>> However for exactly-once guarantee, default of this time is 100ms > whereas > >>> for at-least-once it is 30000ms (i.e. 30sec) > >>> Why is there such a huge time difference between the 2 guarantees and > >> what > >>> does it mean to have this interval as high as 30 seconds, does it also > >>> cause more probability of higher no. of duplicates in case of > application > >>> restarts or partition rebalance ? > >>> Does it mean that the streams application would also publish events to > >>> destination topic only at this interval which means delay in publishing > >>> events to destinations topic ? > >>> > >> > > >