Hi Alexander,

I’m sorry to hear that. It certainly sounds like a hard one to debug. 

To clarify, do you mean that when you observe this problem, the sink node is 
not in the topology at all, or that it is in the topology, but does not 
function properly?

Also, are you using Spring to construct the topology, or are you calling the 
Kafka Streams library directly to build the topology?

If the problem is that the sink node is missing completely, it’s hard to 
imagine how the problem could be Streams. When you are building the topology in 
Streams, there is no connection to Kafka at all.

Then again, I’ve seen enough heisenbugs to know not to trust intuition too 
much. If you can try just using the Streams builder directly to create the 
topology, maybe you can see if you can still reproduce the issue?

I hope this helps!
-John

On Tue, Nov 22, 2022, at 14:07, Alexander Kau wrote:
> My team is building a set of services that use Kafka Connect and Debezium
> to forward data changes from our Postgres database to Kafka, and then use
> Kafka Streams (via Spring Cloud Stream) to process this data and output an
> aggregate of the source data.
>
> We have been trying to track down an issue where the stream processors are
> not correctly configured when the application starts up before Kafka is up.
> Specifically, all of the processor nodes are correctly created except for
> the KTABLE-SINK-000000000# node. The result of this is that the services
> consume messages but do not forward their changes to their output topics.
> Therefore, data processing stops while the consumer offset continues to be
> incremented, so we lose messages and have to roll back the offsets and
> reprocess a large amount of data.
>
> This happens in both our Kubernetes environment and our Chef-managed
> environment. In the Chef environment, simply restarting the server is
> enough to trigger this issue, since Kafka takes longer to start up than the
> application. In Kubernetes, I can reliably trigger the issue by removing
> the application's dependency on Kafka, stopping Kafka, restarting the
> application, and then starting Kafka.
>
> I have tested with Kafka 3.0.0 and 3.3.1, and the behavior does not change.
> We are using the latest Spring dependencies (Spring cloud 2021.0.5, Spring
> Cloud Stream 3.2.6).
>
> This may be a "heisenbug": a bug that only occurs when it is not being
> observed. I spent most of yesterday debugging the Kafka setup code during
> application startup with and without the Kafka broker running, and was
> unable to reproduce the bug while doing so, but was able to consistently
> reproduce it as soon as I stopped debugging the startup process. I suspect
> that this may mean that this is a race condition in the parts of the
> application that only run after connecting to the Kafka broker, but I'm not
> familiar enough with the Kafka source to go much further on this.
>
> Although I understand that this is a bit of an edge case (the Kafka broker
> should generally not go down), the results here involve missing/invalid
> data, and it is not possible to confirm whether the application is in this
> case except by either confirming that "this consumer consumed a message but
> didn't forward it to its output topic" or by hooking up a debugger and
> inspecting the ProcessorContext, so we can't reasonably implement a health
> check to verify whether the application is in the bad state and restart it.
> (Even if we could check for this state, there's no guarantee that it didn't
> lose some messages while it was in the bad state.)
>
> I've done a fair amount of searching for this sort of issue, but have been
> unable to find any other people who I can confirm to have the same issue. I
> am not certain whether this is an issue in Kafka itself or in Spring Cloud
> Stream.
>
> Any guidance or suggestions would be appreciated.

Reply via email to