Hi,

Thank you for your response.

I am not using apache-beam directly but using tensorflow-data-validation
API, so I'm sure about if there is any deadlock or not.
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord

But what I can tell is that I didn't see any stack trace related to
deadlock. I've attached the stack (from pystack) of one worker.
Also the program is running well on a relatively small but still large
enough dataset, which is around 200GB and it took over 10mins to finish.

Thanks,
Junjian



On Tue, Aug 25, 2020 at 12:56 AM Luke Cwik <lc...@google.com> wrote:

> Another person reported something similar for Dataflow and it seemed as
> though in their scenario they were using locks and either got into a
> deadlock or starved processing for long enough that the watchdog also
> failed. Are you using locks and/or having really long single element
> processing times?
>
> On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu <j...@indeed.com> wrote:
>
>> Hi,
>>
>> I’m running into a problem of tensorflow-data-validation with direct
>> runner to generate statistics from some large datasets over 400GB.
>>
>> It seems that all workers stopped working after an error message of
>> “Keepalive watchdog fired. Closing transport.” It seems to be a grpc
>> keepalive timeout.
>>
>> ```
>> E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
>> ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
>> 2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
>> timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
>> harness failed: \nTraceback (most recent call last):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
>> line 158, in main\n
>>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
>> line 213, in run\n    for work_request in
>> self._control_stub.Control(get_responses()):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 416, in __next__\n    return self._next()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
>> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
>> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
>> timeout\"\n\tdebug_error_string =
>> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
>> from peer
>> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
>> watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
>> call last):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
>> line 158, in main\n
>>  sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
>> line 213, in run\n    for work_request in
>> self._control_stub.Control(get_responses()):\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 416, in __next__\n    return self._next()\n  File
>> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
>> 706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
>> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
>> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
>> timeout\"\n\tdebug_error_string =
>> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
>> from peer
>> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
>> watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
>> thread: "MainThread"
>> Traceback (most recent call last):
>>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
>>     "__main__", mod_spec)
>>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
>>     exec(code, run_globalse
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 248, in <module>
>>     main(sys.argv)
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
>> line 158, in main
>>     sdk_pipeline_options.view_as(ProfilingOptions))).run()
>>   File
>> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 213, in run
>>     for work_request in self._control_stub.Control(get_responses()):
>>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
>> line 416, in __next__
>>     return self._next()
>>   File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
>> line 706, in _next
>>     raise self
>> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC
>> that terminated with:
>>         status = StatusCode.UNAVAILABLE
>>         details = "keepalive watchdog timeout"
>>         debug_error_string =
>> "{"created":"@1596563347.420024732","description":"Error received from peer
>> ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive
>> watchdog timeout","grpc_status":14}"
>> ```
>>
>> I originally raised the issue in tensorflow-data-validation community but
>> we couldn't come up with any solution.
>> https://github.com/tensorflow/data-validation/issues/133
>>
>> The beam version is 2.22.0. Please let me know if I missed anything.
>>
>> Thanks,
>> Junjian
>>
>

Attachment: worker_stack
Description: Binary data

Attachment: master_exit_error
Description: Binary data

Reply via email to