Hi,

Any typical patterns for deployment of stateful streaming pipelines from Beam? Targeting Dataflow and Python SDK with significant usage of stateful processing with long windows (typically days long).

From our current practice of maintaining pipelines we did identify 3 typical scenarios:

1. Deployment without breaking changes (no changes to pipeline graph, coders, states, outputs etc.) - just update DataFlow job in place.

2. Deployment with changes to internal state (changes to coders, state, or even pipeline graph but without changing pipeline input/output schemas) - in this case updating job in place would not work as the state did change and reading state saved by old pipeline would result in error.

3. Deployment with changes to output schema (and potentially to internal state too) - we need to take special care of changing output schema to be sure downstream processes also have a time to switch from old version of data to the new one.

To be specific I need some advice/patterns/knowledge on p.2 and p.3. I guess it will require spinning new pipelines with data back-filling or migration jobs? Would really appreciate detailed examples how you are dealing with deploying similar streaming stateful pipelines. Ideally with details on how much data to reprocess to populate internal state, what needs to be done when changing output schema of a pipeline, how to orchestrate all this activities.

Best
Wisniowski PIotr

Reply via email to