Hey Folks I have a few questions around the operations of stateful processing while scaling nodes up/down, and a possible KIP in question #4. Most of them have to do with task processing during rebuilding of state stores after scaling nodes up.
Scenario: Single node/thread, processing 2 topics (10 partitions each): User event topic (events) - ie: key:userId, value: ProductId Product topic (entity) - ie: key: ProductId, value: productData My topology looks like this: KTable productTable = ... //materialize from product topic KStream output = userStream .map(x => (x.value, x.key) ) //Swap the key and value around .join(productTable, ... ) //Joiner is not relevant here .to(...) //Send it to some output topic Here are my questions: 1) If I scale the processing node count up, partitions will be rebalanced to the new node. Does processing continue as normal on the original node, while the new node's processing is paused as the internal state stores are rebuilt/reloaded? From my reading of the code (and own experience) I believe this to be the case, but I am just curious in case I missed something. 2) What happens to the userStream map task? Will the new node be able to process this task while the state store is rebuilding/reloading? My reading of the code suggests that this map process will be paused on the new node while the state store is rebuilt. The effect of this is that it will lead to a delay in events reaching the original node's partitions, which will be seen as late-arriving events. Am I right in this assessment? 3) How does scaling up work with standby state-store replicas? From my reading of the code, it appears that scaling a node up will result in a reabalance, with the state assigned to the new node being rebuilt first (leading to a pause in processing). Following this, the standy replicas are populated. Am I correct in this reading? 4) If my reading in #3 is correct, would it be possible to pre-populate the standby stores on scale-up before initiating active-task transfer? This would allow seamless scale-up and scale-down without requiring any pauses for rebuilding state. I am interested in kicking this off as a KIP if so, but would appreciate any JIRAs or related KIPs to read up on prior to digging into this. Thanks Adam Bellemare