Hi Divij,

I've worked on several projects that had a "debug mode." It was something that 
a lot of old-fashioned C and C++ projects would do. Usually implemented through 
an ASSERT macro or similar that was defined away when in "production mode"

I didn't like this back then, and still don't like it. If the assertion isn't 
expensive, you should just do it all the time. If the assertion is expensive, 
then you should do it in a test rather than when running. Because an expensive 
operation will change the timings of a distributed system, and make your "debug 
mode server" perform quite differently than the "real production server."

Another issue is that, based on my experience, people often did stuff in the 
assert blocks that would change other things in the system. Since code in C/C++ 
(and also Java) can have side effects, it's easy to accidentally change things 
with your verification code.

It sounds like concretely you hit a race condition with the non-thread-safe 
buffer pool code. It would be good to think about how we could avoid this in 
the future, but I don't think "debug mode" is the answer. Instead, it might be 
better to take another look at how we're doing buffer pooling to see if we can 
simplify. Why are we passing a non-thread-safe object between threads in the 
first place? Should this be documented better, or better yet, avoided? Why not 
use a thread-local instead to make this all so much simpler? etc.

best,
Colin

On Tue, Oct 24, 2023, at 02:32, Divij Vaidya wrote:
> Hey folks
>
> We recently came across a bug [1] which was very hard to detect during
> testing and easy to introduce during development. I would like to kick
> start a discussion on potential ways which could avoid this category of
> bugs in Apache Kafka.
>
> I think we might want to start working towards a "debug" mode in the broker
> which will enable assertions for different invariants in Kafka. Invariants
> could be derived from formal verification that Jack [2] and others have
> shared with the community earlier AND from tribal knowledge in the
> community such as network threads should not perform any storage IO, files
> should not fsync in critical product path, metric gauges should not acquire
> a lock etc. The release qualification  process (system tests + integration
> tests) will run the broker in "debug" mode and will validate these
> assertions while testing the system in different scenarios. The inspiration
> for this idea is derived from Marc Brooker's post at
> https://brooker.co.za/blog/2023/07/28/ds-testing.html
>
> Your thoughts on this topic are welcome! Also, please feel free to take
> this idea forward and draft a KIP for a more formal discussion.
>
> [1] https://issues.apache.org/jira/browse/KAFKA-15653
> [2] https://lists.apache.org/thread/pfrkk0yb394l5qp8h5mv9vwthx15084j
>
> --
> Divij Vaidya

Reply via email to