Hi!

Thank you for the hint. We will try with the mitigation from the issue. We did already tried everything from https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact , but lets hope upgrading the dependency will help. Will keep reply to this thread once I get confirmation.

BTW great job on the investigation of bug that you mentioned. Impressive. Seems like a nasty one.

Best,

Wiśniowski Piotr

On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:
You might be running into https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer to rootcause: ``Error message from worker: generic::internal: Error encountered with the status channel: There are 10 consecutive failures obtaining SDK worker status info from sdk-0-0. The last success response was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be permanently unresponsive. Aborting the SDK. For more information, see: https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact``` <https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```>

If mitigations in https://github.com/apache/beam/issues/30867 don't resolve your issue, please see https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact for insturctions on how to find what causes the workers to be stuck.

Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <contact.wisniowskipi...@gmail.com> wrote:

    Hi,
    We are investigating an issue with our Python SDK streaming
    pipelines, and have few questions, but first context.
    Our stack:
    - Python SDK 2.54.0 but we tried also 2.55.1
    - DataFlow Streaming engine with sdk in container image (we tried
    also Prime)
    - Currently our pipelines do have low enough traffic, so that
    single node handles it most of the time, but occasionally we do
    scale up.
    - Deployment by Terraform `google_dataflow_flex_template_job`
    resource, which normally does job update when re-applying Terraform.
    - We do use a lot `ReadModifyWriteStateSpec`, other states and
    watermark timers, but we do keep a the size of state under control.
    - We do use custom coders as Pydantic avro.
    The issue:
    - Occasionally watermark progression stops. The issue is not
    deterministic, and happens like 1-2 per day for few pipelines.
    - No user code errors reported- but we do get errors like this:
    ```INTERNAL: The work item requesting state read is no longer
    valid on the backend. The work has already completed or will be
    retried. This is expected during autoscaling events.
    
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
    
<http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
    { trail_point { source_file_loc { filepath:
    "dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } }
    }']```
    ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means
    that the process running the pipeline code has crashed. Inspect
    the Worker Logs and the Diagnostics tab to determine the cause of
    the crash.
    
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
    
<http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
    { trail_point { source_file_loc { filepath:
    "dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217
    } } } [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999
    [dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT
    }']```
    ```Work item for sharding key 8dd4578b4f280f5d tokens
    (1316764909133315359, 17766288489530478880) encountered error
    during processing, will be retried (possibly on another worker):
    generic::internal: Error encountered with the status channel: SDK
    harness sdk-0-0 disconnected. with MessageCode:
    (93f1db2f7a4a325c): SDK disconnect.```
    ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal:
    segmentation fault (core dumped) restarting SDK process```
    - We did manage to correlate this with either vertical autoscaling
    event (when using Prime) or other worker replacements done by
    Dataflow under the hood, but this is not deterministic.
    - For few hours watermark progress does stop, but other workers do
    process messages.
    - and after few hours:
    ```Error message from worker: generic::internal: Error encountered
    with the status channel: There are 10 consecutive failures
    obtaining SDK worker status info from sdk-0-0. The last success
    response was received 3h20m2.648304212s ago at
    2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be
    permanently unresponsive. Aborting the SDK. For more information,
    see:
    
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
    - And the pipeline starts to catch up and watermark progresses again.
    - Job update by Terraform apply also fixes the issue.
    - We do not see any extensive use of worker memory nor disk. CPU
    utilization is also most of the time close to idle. I do not think
    we do use C/C++ code with python. Nor use parallelism/threads
    outside beam parallelization.
    Questions:
    1. What could be potential causes of such behavior? How to get
    more insights to this problem?
    2. I have seen `In Python pipelines, when shutting down inactive
    bundle processors, shutdown logic can overaggressively hold the
    lock, blocking acceptance of new work` in Beam release docs as
    known issue. What is the status of this? Can this potentially be
    related?
    Really appreciate any help, clues or hints how to debug this issue.
    Best regards
    Wiśniowski Piotr

Reply via email to