Hi,

I’m trying to “enrich” a stream of events (i.e. roughly speaking read messages 
from one topic, make a query to an external system, write to another topic).  
The problem is – external system can handle lots of calls, but has a high 
latency. 
I can easily do the enrichment if I use one thread per message, but that's 
kinda' wasteful (I'd need a lot of partitions/ tasks to have reasonable 
throughput).... so, I'm thinking about using multiplexed IO. Like, have a 
processor that just registers the "input message"  in a local state store, 
without committing the task; and then, the punctuator can look at the 
registered "input messages", start the IO for all of them, forward the results 
for the completed IO tasks to subsequent processors, and commit progress. The 
state store can be non-replicated, since I can reprocess messages in case of 
failure (and I don't really mind duplication of messages on the output topic)

The question is... would that work? I'm concerned about rebalancing (when one 
worker is added/removed), and specifically:
- If I understand correctly the code, during partition rebalance the tasks may 
be suspended, and the suspend() method will actually commit the offsets (the 
last task where process() was completed, regardless whether it invoked commit() 
or not). That'd be bad, since it means that on rebalance, I might end up 
skipping records (I'm not concerned about duplication, within reasonable 
bounds; but I am concerned about skipping records)
- Not sure what happens to the state during rebalancing (if I disable the 
changelog - i.e. I make all the state local). Is all state lost? 

Thanks,
Virgil.

Reply via email to