Re: Watermark progress halt in Python streaming pipelines

2024-04-24 Thread Valentyn Tymofieiev via user
On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr <
contact.wisniowskipi...@gmail.com> wrote:

> Hi!
>
> Thank you for the hint. We will try with the mitigation from the issue. We
> did already tried everything from
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> , but lets hope upgrading the dependency will help. Will keep reply to this
> thread once I get confirmation.
>
> BTW great job on the investigation of bug that you mentioned. Impressive.
> Seems like a nasty one.
>
Thanks. I was specifically recommending you check the recently added
content under "It might be possible to retrieve stacktraces of a thread
that is holding the GIL on a running Dataflow worker as follows:", as that
should help find out what is causing stuckness in your case. But hopefully
it won't be necessary after you adjust the grpcio version.



> Best,
>
> Wiśniowski Piotr
> On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:
>
> You might be running into https://github.com/apache/beam/issues/30867.
>
> Among the error messages you mentioned, the  following is closer to
> rootcause: ``Error message from worker: generic::internal: Error
> encountered with the status channel: There are 10 consecutive failures
> obtaining SDK worker status info from sdk-0-0. The last success response
> was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00.
> SDK worker appears to be permanently unresponsive. Aborting the SDK. For
> more information, see:
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
> 
>
> If mitigations in https://github.com/apache/beam/issues/30867 don't
> resolve your issue, please see
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> for insturctions on how to find what causes the workers to be stuck.
>
> Thanks!
>
> On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <
> contact.wisniowskipi...@gmail.com> wrote:
>
>> Hi,
>> We are investigating an issue with our Python SDK streaming pipelines,
>> and have few questions, but first context.
>> Our stack:
>> - Python SDK 2.54.0 but we tried also 2.55.1
>> - DataFlow Streaming engine with sdk in container image (we tried also
>> Prime)
>> - Currently our pipelines do have low enough traffic, so that single
>> node handles it most of the time, but occasionally we do scale up.
>> - Deployment by Terraform `google_dataflow_flex_template_job` resource,
>> which normally does job update when re-applying Terraform.
>> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark
>> timers, but we do keep a the size of state under control.
>> - We do use custom coders as Pydantic avro.
>> The issue:
>> - Occasionally watermark progression stops. The issue is not
>> deterministic, and happens like 1-2 per day for few pipelines.
>> - No user code errors reported- but we do get errors like this:
>> ```INTERNAL: The work item requesting state read is no longer valid on
>> the backend. The work has already completed or will be retried. This is
>> expected during autoscaling events. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> > ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the
>> process running the pipeline code has crashed. Inspect the Worker Logs and
>> the Diagnostics tab to determine the cause of the crash. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> > ```Work item for sharding key 8dd4578b4f280f5d tokens
>> (1316764909133315359, 17766288489530478880) encountered error during
>> processing, will be retried (possibly on another worker):
>> generic::internal: Error encountered with the status channel: SDK harness
>> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK
>> disconnect.```
>> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation
>> fault (core dumped) restarting SDK process```
>> - We did manage to correlate this with either vertical autoscaling event
>> (when using Prime) or other worker replacements done by Dataflow under the
>> hood, but this is not deterministic.
>> - For few hours watermark progress does stop, but other workers do
>> process messages.
>> - and after few hours:
>> ```Error message from worker: generic::internal: Error encountered with
>

Re: Watermark progress halt in Python streaming pipelines

2024-04-24 Thread Wiśniowski Piotr

Hi!

Thank you for the hint. We will try with the mitigation from the issue. 
We did already tried everything from 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact 
, but lets hope upgrading the dependency will help. Will keep reply to 
this thread once I get confirmation.


BTW great job on the investigation of bug that you mentioned. 
Impressive. Seems like a nasty one.


Best,

Wiśniowski Piotr

On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:

You might be running into https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer to 
rootcause: ``Error message from worker: generic::internal: Error 
encountered with the status channel: There are 10 consecutive failures 
obtaining SDK worker status info from sdk-0-0. The last success 
response was received 3h20m2.648304212s ago at 
2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be 
permanently unresponsive. Aborting the SDK. For more information, see: 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact``` 



If mitigations in https://github.com/apache/beam/issues/30867 don't 
resolve your issue, please see 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact 
for insturctions on how to find what causes the workers to be stuck.


Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr 
 wrote:


Hi,
We are investigating an issue with our Python SDK streaming
pipelines, and have few questions, but first context.
Our stack:
- Python SDK 2.54.0 but we tried also 2.55.1
- DataFlow Streaming engine with sdk in container image (we tried
also Prime)
- Currently our pipelines do have low enough traffic, so that
single node handles it most of the time, but occasionally we do
scale up.
- Deployment by Terraform `google_dataflow_flex_template_job`
resource, which normally does job update when re-applying Terraform.
- We do use a lot `ReadModifyWriteStateSpec`, other states and
watermark timers, but we do keep a the size of state under control.
- We do use custom coders as Pydantic avro.
The issue:
- Occasionally watermark progression stops. The issue is not
deterministic, and happens like 1-2 per day for few pipelines.
- No user code errors reported- but we do get errors like this:
```INTERNAL: The work item requesting state read is no longer
valid on the backend. The work has already completed or will be
retried. This is expected during autoscaling events.

[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]


{ trail_point { source_file_loc { filepath:
"dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217
} } } [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999
[dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT
}']```
```Work item for sharding key 8dd4578b4f280f5d tokens
(1316764909133315359, 17766288489530478880) encountered error
during processing, will be retried (possibly on another worker):
generic::internal: Error encountered with the status channel: SDK
harness sdk-0-0 disconnected. with MessageCode:
(93f1db2f7a4a325c): SDK disconnect.```
```Python (worker sdk-0-0_sibling_1) exited 1 times: signal:
segmentation fault (core dumped) restarting SDK process```
- We did manage to correlate this with either vertical autoscaling
event (when using Prime) or other worker replacements done by
Dataflow under the hood, but this is not deterministic.
- For few hours watermark progress does stop, but other workers do
process messages.
- and after few hours:
```Error message from worker: generic::internal: Error encountered
with the status channel: There are 10 consecutive failures
obtaining SDK worker status info from sdk-0-0. The last success
response was received 3h20m2.648304212s ago at
2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be
permanently unresponsive. Aborting the SDK. For more information,
see:

https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
- And the pipeline starts to catch up and w