Hey, thanks! Responses inline. > 1. Please, please, open tickets for anything you need. Your pain on the > offset management is something I've been mulling over. I don't have a > really good idea on how to solve this, but I agree we should figure it out.
Yes, absolutely. I'd been holding off opening them until I had a better idea of exactly what I was missing, but you can expect a ticket or two soon. > 2. "But if our task is the only producer for that partition, it¹s easy to > calculate the value of nŠ and by starting from the commit, and then > dropping the first n messages of output we create, we can get ourselves > back to a consistent state." This is true only in cases where the > processing is deterministic. There are a shocking amount of cases where > this is not true, including some which are very subtle. If you rely on > wall-clock time, for example, your logic will change between processing, > so simply dropping N messages could lead to data loss or duplication. > Other, more nuanced, non-determinism includes deploying a different > (newer) version of code, dependency on external systems, etc. You're completely right; determinism is a huge consideration here. I had a note on that in an earlier draft, but I seem to have edited it out completely -- thanks for pointing this out. The short version: if you can have zero or more outputs for every input, it's definitely possible to get these weird results. If there's precisely one output for every input message, though, you typically *don't* have the same issues. It's also frequently possible to split a nondeterministic job that has multiple outputs into two: one nondeterministic half that batches its outputs as a single message, and one deterministic one that splits that message up into multiple outputs for some downstream job. Again, there's a lot to say here... I'll save it for the next post. > 3. You might want to consider the case where a downstream consumer is > reading from a changelog, as well. This is a use case we've seen pop up > once in a while. This is a great optimization -- but as far as coordination goes, I think it's similar to the general state/changelog situation. (Though you do get to skip the downstream offset.) > 4. One variation of merge log that we'd considered was to batch reads from > an SSP together. This would allow you to shrink your merge log, so you can > say, "Read 100 messages from SSP1, 100 messages from SSP2, etc." This > allows you to drastically shrink the merge log's message count, which > should improve performance. Samza's DefaultMessageChooser supports this > style of batching, but it's off by default. I've thought a bit about this; it does indeed save a lot of messages (though repeating the same topic names over and over does gzip quite well). I'm not sure how well it fits behind the existing MessageChooser interface, though -- you need to know that there are 100 messages coming for that particular stream, but the message chooser only sees them one at a time. > 5. You should have a look at Kafka's transactionality proposal, as well > (https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+ > in+Kafka). This is the only way that I'm aware of that will support all > use cases (including non-deterministic ones). The transactional-messaging work is very cool, but it raises some complexity / scalability flags for me. (All transactions serialized through a single coordinator, unbounded buffering in the client, etc.) It does still seem like a useful pattern, but I do think there's still space for these other tools. I'd love to see a project that does for Kafka what Curator does for Zookeeper -- packaging up some of these patterns as a library and decoupling them from the core project. But that's another conversation... -- Ben Kirwin http://ben.kirw.in/
