Hi,
Any typical patterns for deployment of stateful streaming pipelines from
Beam? Targeting Dataflow and Python SDK with significant usage of
stateful processing with long windows (typically days long).
From our current practice of maintaining pipelines we did identify 3
typical scenarios:
1. Deployment without breaking changes (no changes to pipeline graph,
coders, states, outputs etc.) - just update DataFlow job in place.
2. Deployment with changes to internal state (changes to coders, state,
or even pipeline graph but without changing pipeline input/output
schemas) - in this case updating job in place would not work as the
state did change and reading state saved by old pipeline would result in
error.
3. Deployment with changes to output schema (and potentially to internal
state too) - we need to take special care of changing output schema to
be sure downstream processes also have a time to switch from old version
of data to the new one.
To be specific I need some advice/patterns/knowledge on p.2 and p.3. I
guess it will require spinning new pipelines with data back-filling or
migration jobs? Would really appreciate detailed examples how you are
dealing with deploying similar streaming stateful pipelines. Ideally
with details on how much data to reprocess to populate internal state,
what needs to be done when changing output schema of a pipeline, how to
orchestrate all this activities.
Best
Wisniowski PIotr