Watermark progress halt in Python streaming pipelines

Wiśniowski Piotr Tue, 23 Apr 2024 12:17:14 -0700

Hi,

We are investigating an issue with our Python SDK streaming pipelines,and have few questions, but first context.

Our stack:
- Python SDK 2.54.0 but we tried also 2.55.1

- DataFlow Streaming engine with sdk in container image (we tried alsoPrime)- Currently our pipelines do have low enough traffic, so that singlenode handles it most of the time, but occasionally we do scale up.- Deployment by Terraform `google_dataflow_flex_template_job` resource,which normally does job update when re-applying Terraform.- We do use a lot `ReadModifyWriteStateSpec`, other states and watermarktimers, but we do keep a the size of state under control.

- We do use custom coders as Pydantic avro.
The issue:

- Occasionally watermark progression stops. The issue is notdeterministic, and happens like 1-2 per day for few pipelines.

- No user code errors reported- but we do get errors like this:

```INTERNAL: The work item requesting state read is no longer valid onthe backend. The work has already completed or will be retried. This isexpected during autoscaling events.[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]{ trail_point { source_file_loc { filepath:"dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']``````ABORTED: SDK harness sdk-0-0 disconnected. This usually means thatthe process running the pipeline code has crashed. Inspect the WorkerLogs and the Diagnostics tab to determine the cause of the crash.[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]{ trail_point { source_file_loc { filepath:"dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } }[dist_proc.dax.MessageCode] { origin_id: 5391582787251181999[dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']``````Work item for sharding key 8dd4578b4f280f5d tokens(1316764909133315359, 17766288489530478880) encountered error duringprocessing, will be retried (possibly on another worker):generic::internal: Error encountered with the status channel: SDKharness sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDKdisconnect.``````Python (worker sdk-0-0_sibling_1) exited 1 times: signal:segmentation fault (core dumped) restarting SDK process```- We did manage to correlate this with either vertical autoscaling event(when using Prime) or other worker replacements done by Dataflow underthe hood, but this is not deterministic.- For few hours watermark progress does stop, but other workers doprocess messages.

- and after few hours:

```Error message from worker: generic::internal: Error encountered withthe status channel: There are 10 consecutive failures obtaining SDKworker status info from sdk-0-0. The last success response was received3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK workerappears to be permanently unresponsive. Aborting the SDK. For moreinformation, see:https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```

- And the pipeline starts to catch up and watermark progresses again.
- Job update by Terraform apply also fixes the issue.

- We do not see any extensive use of worker memory nor disk. CPUutilization is also most of the time close to idle. I do not think we douse C/C++ code with python. Nor use parallelism/threads outside beamparallelization.

Questions:

1. What could be potential causes of such behavior? How to get moreinsights to this problem?2. I have seen `In Python pipelines, when shutting down inactive bundleprocessors, shutdown logic can overaggressively hold the lock, blockingacceptance of new work` in Beam release docs as known issue. What is thestatus of this? Can this potentially be related?

Really appreciate any help, clues or hints how to debug this issue.
Best regards
Wiśniowski Piotr

Watermark progress halt in Python streaming pipelines

Reply via email to