Re: Watermark progress halt in Python streaming pipelines

2024-04-30 Thread Wiśniowski Piotr

Hi,

Getting back with update. After updating `grpcio` the issues are gone. 
Thank you for the solution and investigation. Feels like I own you a 
beer :)


Best

Wiśniowski Piotr

On 24.04.2024 22:11, Valentyn Tymofieiev wrote:



On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr 
 wrote:


Hi!

Thank you for the hint. We will try with the mitigation from the
issue. We did already tried everything from

https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
, but lets hope upgrading the dependency will help. Will keep
reply to this thread once I get confirmation.

BTW great job on the investigation of bug that you mentioned.
Impressive. Seems like a nasty one.

Thanks. I was specifically recommending you check the recently added 
content under "It might be possible to retrieve stacktraces of a 
thread that is holding the GIL on a running Dataflow worker as 
follows:", as that should help find out what is causing stuckness in 
your case. But hopefully it won't be necessary after you adjust the 
grpcio version.


Best,

Wiśniowski Piotr

On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:

You might be running into
https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer
to rootcause: ``Error message from worker: generic::internal:
Error encountered with the status channel: There are 10
consecutive failures obtaining SDK worker status info from
sdk-0-0. The last success response was received 3h20m2.648304212s
ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker appears to
be permanently unresponsive. Aborting the SDK. For more
information, see:

https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```



If mitigations in https://github.com/apache/beam/issues/30867
don't resolve your issue, please see

https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
for insturctions on how to find what causes the workers to be stuck.

Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr
 wrote:

Hi,
We are investigating an issue with our Python SDK streaming
pipelines, and have few questions, but first context.
Our stack:
- Python SDK 2.54.0 but we tried also 2.55.1
- DataFlow Streaming engine with sdk in container image (we
tried also Prime)
- Currently our pipelines do have low enough traffic, so that
single node handles it most of the time, but occasionally we
do scale up.
- Deployment by Terraform `google_dataflow_flex_template_job`
resource, which normally does job update when re-applying
Terraform.
- We do use a lot `ReadModifyWriteStateSpec`, other states
and watermark timers, but we do keep a the size of state
under control.
- We do use custom coders as Pydantic avro.
The issue:
- Occasionally watermark progression stops. The issue is not
deterministic, and happens like 1-2 per day for few pipelines.
- No user code errors reported- but we do get errors like this:
```INTERNAL: The work item requesting state read is no longer
valid on the backend. The work has already completed or will
be retried. This is expected during autoscaling events.

[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]


{ trail_point { source_file_loc { filepath:
"dist_proc/dax/workflow/worker/fnapi_control_service.cc"
line: 217 } } } [dist_proc.dax.MessageCode] { origin_id:
5391582787251181999
[dist_proc.dax.workflow.workflow_io_message_ext]:
SDK_DISCONNECT }']```
```Work item for sharding key 8dd4578b4f280f5d tokens
(1316764909133315359, 17766288489530478880) encountered error
during processing, will be retried (possibly on another
worker): generic::internal: Error encountered with the status
channel: SDK harness sdk-0-0 disconnected. with MessageCode:
(93f1db2f7a4a325c): SDK disconnect.```
```Python (worker 

Re: Watermark progress halt in Python streaming pipelines

2024-04-24 Thread Valentyn Tymofieiev via user
On Wed, Apr 24, 2024 at 12:40 PM Wiśniowski Piotr <
contact.wisniowskipi...@gmail.com> wrote:

> Hi!
>
> Thank you for the hint. We will try with the mitigation from the issue. We
> did already tried everything from
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> , but lets hope upgrading the dependency will help. Will keep reply to this
> thread once I get confirmation.
>
> BTW great job on the investigation of bug that you mentioned. Impressive.
> Seems like a nasty one.
>
Thanks. I was specifically recommending you check the recently added
content under "It might be possible to retrieve stacktraces of a thread
that is holding the GIL on a running Dataflow worker as follows:", as that
should help find out what is causing stuckness in your case. But hopefully
it won't be necessary after you adjust the grpcio version.



> Best,
>
> Wiśniowski Piotr
> On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:
>
> You might be running into https://github.com/apache/beam/issues/30867.
>
> Among the error messages you mentioned, the  following is closer to
> rootcause: ``Error message from worker: generic::internal: Error
> encountered with the status channel: There are 10 consecutive failures
> obtaining SDK worker status info from sdk-0-0. The last success response
> was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00.
> SDK worker appears to be permanently unresponsive. Aborting the SDK. For
> more information, see:
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
> 
>
> If mitigations in https://github.com/apache/beam/issues/30867 don't
> resolve your issue, please see
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> for insturctions on how to find what causes the workers to be stuck.
>
> Thanks!
>
> On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <
> contact.wisniowskipi...@gmail.com> wrote:
>
>> Hi,
>> We are investigating an issue with our Python SDK streaming pipelines,
>> and have few questions, but first context.
>> Our stack:
>> - Python SDK 2.54.0 but we tried also 2.55.1
>> - DataFlow Streaming engine with sdk in container image (we tried also
>> Prime)
>> - Currently our pipelines do have low enough traffic, so that single
>> node handles it most of the time, but occasionally we do scale up.
>> - Deployment by Terraform `google_dataflow_flex_template_job` resource,
>> which normally does job update when re-applying Terraform.
>> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark
>> timers, but we do keep a the size of state under control.
>> - We do use custom coders as Pydantic avro.
>> The issue:
>> - Occasionally watermark progression stops. The issue is not
>> deterministic, and happens like 1-2 per day for few pipelines.
>> - No user code errors reported- but we do get errors like this:
>> ```INTERNAL: The work item requesting state read is no longer valid on
>> the backend. The work has already completed or will be retried. This is
>> expected during autoscaling events. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> > ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the
>> process running the pipeline code has crashed. Inspect the Worker Logs and
>> the Diagnostics tab to determine the cause of the crash. [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> > ```Work item for sharding key 8dd4578b4f280f5d tokens
>> (1316764909133315359, 17766288489530478880) encountered error during
>> processing, will be retried (possibly on another worker):
>> generic::internal: Error encountered with the status channel: SDK harness
>> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK
>> disconnect.```
>> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation
>> fault (core dumped) restarting SDK process```
>> - We did manage to correlate this with either vertical autoscaling event
>> (when using Prime) or other worker replacements done by Dataflow under the
>> hood, but this is not deterministic.
>> - For few hours watermark progress does stop, but other workers do
>> process messages.
>> - and after few hours:
>> ```Error message from worker: generic::internal: Error encountered with

Re: Watermark progress halt in Python streaming pipelines

2024-04-24 Thread Wiśniowski Piotr

Hi!

Thank you for the hint. We will try with the mitigation from the issue. 
We did already tried everything from 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact 
, but lets hope upgrading the dependency will help. Will keep reply to 
this thread once I get confirmation.


BTW great job on the investigation of bug that you mentioned. 
Impressive. Seems like a nasty one.


Best,

Wiśniowski Piotr

On 24.04.2024 00:31, Valentyn Tymofieiev via user wrote:

You might be running into https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer to 
rootcause: ``Error message from worker: generic::internal: Error 
encountered with the status channel: There are 10 consecutive failures 
obtaining SDK worker status info from sdk-0-0. The last success 
response was received 3h20m2.648304212s ago at 
2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be 
permanently unresponsive. Aborting the SDK. For more information, see: 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact``` 



If mitigations in https://github.com/apache/beam/issues/30867 don't 
resolve your issue, please see 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact 
for insturctions on how to find what causes the workers to be stuck.


Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr 
 wrote:


Hi,
We are investigating an issue with our Python SDK streaming
pipelines, and have few questions, but first context.
Our stack:
- Python SDK 2.54.0 but we tried also 2.55.1
- DataFlow Streaming engine with sdk in container image (we tried
also Prime)
- Currently our pipelines do have low enough traffic, so that
single node handles it most of the time, but occasionally we do
scale up.
- Deployment by Terraform `google_dataflow_flex_template_job`
resource, which normally does job update when re-applying Terraform.
- We do use a lot `ReadModifyWriteStateSpec`, other states and
watermark timers, but we do keep a the size of state under control.
- We do use custom coders as Pydantic avro.
The issue:
- Occasionally watermark progression stops. The issue is not
deterministic, and happens like 1-2 per day for few pipelines.
- No user code errors reported- but we do get errors like this:
```INTERNAL: The work item requesting state read is no longer
valid on the backend. The work has already completed or will be
retried. This is expected during autoscaling events.

[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]


{ trail_point { source_file_loc { filepath:
"dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217
} } } [dist_proc.dax.MessageCode] { origin_id: 5391582787251181999
[dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT
}']```
```Work item for sharding key 8dd4578b4f280f5d tokens
(1316764909133315359, 17766288489530478880) encountered error
during processing, will be retried (possibly on another worker):
generic::internal: Error encountered with the status channel: SDK
harness sdk-0-0 disconnected. with MessageCode:
(93f1db2f7a4a325c): SDK disconnect.```
```Python (worker sdk-0-0_sibling_1) exited 1 times: signal:
segmentation fault (core dumped) restarting SDK process```
- We did manage to correlate this with either vertical autoscaling
event (when using Prime) or other worker replacements done by
Dataflow under the hood, but this is not deterministic.
- For few hours watermark progress does stop, but other workers do
process messages.
- and after few hours:
```Error message from worker: generic::internal: Error encountered
with the status channel: There are 10 consecutive failures
obtaining SDK worker status info from sdk-0-0. The last success
response was received 3h20m2.648304212s ago at
2024-04-23T11:48:35.493682768+00:00. SDK worker appears to be
permanently unresponsive. Aborting the SDK. For more information,
see:

https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```
- And the pipeline starts to catch up and 

Re: Watermark progress halt in Python streaming pipelines

2024-04-23 Thread Valentyn Tymofieiev via user
You might be running into https://github.com/apache/beam/issues/30867.

Among the error messages you mentioned, the  following is closer to
rootcause: ``Error message from worker: generic::internal: Error
encountered with the status channel: There are 10 consecutive failures
obtaining SDK worker status info from sdk-0-0. The last success response
was received 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00.
SDK worker appears to be permanently unresponsive. Aborting the SDK. For
more information, see:
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```

If mitigations in https://github.com/apache/beam/issues/30867 don't resolve
your issue, please see
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
for insturctions on how to find what causes the workers to be stuck.

Thanks!

On Tue, Apr 23, 2024 at 12:17 PM Wiśniowski Piotr <
contact.wisniowskipi...@gmail.com> wrote:

> Hi,
> We are investigating an issue with our Python SDK streaming pipelines, and
> have few questions, but first context.
> Our stack:
> - Python SDK 2.54.0 but we tried also 2.55.1
> - DataFlow Streaming engine with sdk in container image (we tried also
> Prime)
> - Currently our pipelines do have low enough traffic, so that single node
> handles it most of the time, but occasionally we do scale up.
> - Deployment by Terraform `google_dataflow_flex_template_job` resource,
> which normally does job update when re-applying Terraform.
> - We do use a lot `ReadModifyWriteStateSpec`, other states and watermark
> timers, but we do keep a the size of state under control.
> - We do use custom coders as Pydantic avro.
> The issue:
> - Occasionally watermark progression stops. The issue is not
> deterministic, and happens like 1-2 per day for few pipelines.
> - No user code errors reported- but we do get errors like this:
> ```INTERNAL: The work item requesting state read is no longer valid on the
> backend. The work has already completed or will be retried. This is
> expected during autoscaling events. [
> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>  ```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that the
> process running the pipeline code has crashed. Inspect the Worker Logs and
> the Diagnostics tab to determine the cause of the crash. [
> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>  ```Work item for sharding key 8dd4578b4f280f5d tokens
> (1316764909133315359, 17766288489530478880) encountered error during
> processing, will be retried (possibly on another worker):
> generic::internal: Error encountered with the status channel: SDK harness
> sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK
> disconnect.```
> ```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: segmentation
> fault (core dumped) restarting SDK process```
> - We did manage to correlate this with either vertical autoscaling event
> (when using Prime) or other worker replacements done by Dataflow under the
> hood, but this is not deterministic.
> - For few hours watermark progress does stop, but other workers do
> process messages.
> - and after few hours:
> ```Error message from worker: generic::internal: Error encountered with
> the status channel: There are 10 consecutive failures obtaining SDK worker
> status info from sdk-0-0. The last success response was received
> 3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker
> appears to be permanently unresponsive. Aborting the SDK. For more
> information, see:
> https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact
> ```
> - And the pipeline starts to catch up and watermark progresses again.
> - Job update by Terraform apply also fixes the issue.
> - We do not see any extensive use of worker memory nor disk. CPU
> utilization is also most of the time close to idle. I do not think we do
> use C/C++ code with python. Nor use parallelism/threads outside beam
> parallelization.
> Questions:
> 1. What could be potential causes of such behavior? How to get more
> insights to this problem?
> 2. I have seen `In Python pipelines, when shutting down inactive bundle
> processors, shutdown logic can overaggressively hold the lock, blocking
> acceptance of new work` in Beam release docs as known issue. What is the
> status of this? Can this potentially 

Watermark progress halt in Python streaming pipelines

2024-04-23 Thread Wiśniowski Piotr

Hi,
We are investigating an issue with our Python SDK streaming pipelines, 
and have few questions, but first context.

Our stack:
- Python SDK 2.54.0 but we tried also 2.55.1
- DataFlow Streaming engine with sdk in container image (we tried also 
Prime)
- Currently our pipelines do have low enough traffic, so that single 
node handles it most of the time, but occasionally we do scale up.
- Deployment by Terraform `google_dataflow_flex_template_job` resource, 
which normally does job update when re-applying Terraform.
- We do use a lot `ReadModifyWriteStateSpec`, other states and watermark 
timers, but we do keep a the size of state under control.

- We do use custom coders as Pydantic avro.
The issue:
- Occasionally watermark progression stops. The issue is not 
deterministic, and happens like 1-2 per day for few pipelines.

- No user code errors reported- but we do get errors like this:
```INTERNAL: The work item requesting state read is no longer valid on 
the backend. The work has already completed or will be retried. This is 
expected during autoscaling events. 
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] 
{ trail_point { source_file_loc { filepath: 
"dist_proc/windmill/client/streaming_rpc_client.cc" line: 767 } } }']```
```ABORTED: SDK harness sdk-0-0 disconnected. This usually means that 
the process running the pipeline code has crashed. Inspect the Worker 
Logs and the Diagnostics tab to determine the cause of the crash. 
[type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] 
{ trail_point { source_file_loc { filepath: 
"dist_proc/dax/workflow/worker/fnapi_control_service.cc" line: 217 } } } 
[dist_proc.dax.MessageCode] { origin_id: 5391582787251181999 
[dist_proc.dax.workflow.workflow_io_message_ext]: SDK_DISCONNECT }']```
```Work item for sharding key 8dd4578b4f280f5d tokens 
(1316764909133315359, 17766288489530478880) encountered error during 
processing, will be retried (possibly on another worker): 
generic::internal: Error encountered with the status channel: SDK 
harness sdk-0-0 disconnected. with MessageCode: (93f1db2f7a4a325c): SDK 
disconnect.```
```Python (worker sdk-0-0_sibling_1) exited 1 times: signal: 
segmentation fault (core dumped) restarting SDK process```
- We did manage to correlate this with either vertical autoscaling event 
(when using Prime) or other worker replacements done by Dataflow under 
the hood, but this is not deterministic.
- For few hours watermark progress does stop, but other workers do 
process messages.

- and after few hours:
```Error message from worker: generic::internal: Error encountered with 
the status channel: There are 10 consecutive failures obtaining SDK 
worker status info from sdk-0-0. The last success response was received 
3h20m2.648304212s ago at 2024-04-23T11:48:35.493682768+00:00. SDK worker 
appears to be permanently unresponsive. Aborting the SDK. For more 
information, see: 
https://cloud.google.com/dataflow/docs/guides/common-errors#worker-lost-contact```

- And the pipeline starts to catch up and watermark progresses again.
- Job update by Terraform apply also fixes the issue.
- We do not see any extensive use of worker memory nor disk. CPU 
utilization is also most of the time close to idle. I do not think we do 
use C/C++ code with python. Nor use parallelism/threads outside beam 
parallelization.

Questions:
1. What could be potential causes of such behavior? How to get more 
insights to this problem?
2. I have seen `In Python pipelines, when shutting down inactive bundle 
processors, shutdown logic can overaggressively hold the lock, blocking 
acceptance of new work` in Beam release docs as known issue. What is the 
status of this? Can this potentially be related?

Really appreciate any help, clues or hints how to debug this issue.
Best regards
Wiśniowski Piotr