Kafka looks like an exciting project, thanks for opening it up. I have a few questions:
1. Are checksums end to end (ie, created by the producer and checked by the consumer)? or are they only used to confirm buffercache behavior on disk as mentioned in the documentation? Bit errors occur vastly more often than most people assume, often because of device driver bugs. TCP only detects 1 error in 65536, so errors can flow through (if you like I can send links to papers describing the need for checksums everywhere). 2. The consumer has a pretty solid mechanism to ensure it hasnt missed any messages (i like the design by the way), but how does the producer know that all of its messages have been stored? (no apparent message id on that side since the message id isnt known until the message is written to the file). I'm especially curious how failover/replication could be implemented and I'm thinking that acks on the publisher side may help) 3. Has the consumer's flow control been tested over high bandwidth*delay links? (what bandwidth can you get from a London consumer of an SF cluster?) 4. What kind of performance do you get if you set the producer's message delay to zero? (ie, is there a separate system call for each message? or do you manage to aggregate messages into a smaller number of system calls even with a delay of 0?) 5. Have you considered using a library like zeromq for the messaging layer instead of rolling your own? (zeromq will handle #4 cleanly at millions of messages per second and has clients in 20 languages) 6. Do you have any plans to support intermediate processing elements the way Flume supports? 7. The docs mention that new versions will only be released after they are in production at LinkedIn? Does that mean that the latest version of the source code is hidden at LinkedIn and contributors would have to throw patches over the wall and wait months to get the integrated product? Thanks!
