Deployment approach for stateful streaming jobs

Wiśniowski Piotr Tue, 03 Oct 2023 03:23:34 -0700

Hi,

Any typical patterns for deployment of stateful streaming pipelines fromBeam? Targeting Dataflow and Python SDK with significant usage ofstateful processing with long windows (typically days long).

From our current practice of maintaining pipelines we did identify 3typical scenarios:

1. Deployment without breaking changes (no changes to pipeline graph,coders, states, outputs etc.) - just update DataFlow job in place.

2. Deployment with changes to internal state (changes to coders, state,or even pipeline graph but without changing pipeline input/outputschemas) - in this case updating job in place would not work as thestate did change and reading state saved by old pipeline would result inerror.

3. Deployment with changes to output schema (and potentially to internalstate too) - we need to take special care of changing output schema tobe sure downstream processes also have a time to switch from old versionof data to the new one.

To be specific I need some advice/patterns/knowledge on p.2 and p.3. Iguess it will require spinning new pipelines with data back-filling ormigration jobs? Would really appreciate detailed examples how you aredealing with deploying similar streaming stateful pipelines. Ideallywith details on how much data to reprocess to populate internal state,what needs to be done when changing output schema of a pipeline, how toorchestrate all this activities.


Best
Wisniowski PIotr

Deployment approach for stateful streaming jobs

Reply via email to