Hi, Thank you for your response.
I am not using apache-beam directly but using tensorflow-data-validation API, so I'm sure about if there is any deadlock or not. https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord But what I can tell is that I didn't see any stack trace related to deadlock. I've attached the stack (from pystack) of one worker. Also the program is running well on a relatively small but still large enough dataset, which is around 200GB and it took over 10mins to finish. Thanks, Junjian On Tue, Aug 25, 2020 at 12:56 AM Luke Cwik <lc...@google.com> wrote: > Another person reported something similar for Dataflow and it seemed as > though in their scenario they were using locks and either got into a > deadlock or starved processing for long enough that the watchdog also > failed. Are you using locks and/or having really long single element > processing times? > > On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu <j...@indeed.com> wrote: > >> Hi, >> >> I’m running into a problem of tensorflow-data-validation with direct >> runner to generate statistics from some large datasets over 400GB. >> >> It seems that all workers stopped working after an error message of >> “Keepalive watchdog fired. Closing transport.” It seems to be a grpc >> keepalive timeout. >> >> ``` >> E0804 17:49:07.419950276 44806 chttp2_transport.cc:2881] >> ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport. >> 2020-08-04 17:49:07 local_job_service.py : INFO Worker: severity: ERROR >> timestamp { seconds: 1596563347 nanos: 420487403 } message: "Python sdk >> harness failed: \nTraceback (most recent call last):\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", >> line 158, in main\n >> sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", >> line 213, in run\n for work_request in >> self._control_stub.Control(get_responses()):\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line >> 416, in __next__\n return self._next()\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line >> 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: >> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = >> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog >> timeout\"\n\tdebug_error_string = >> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received >> from peer >> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive >> watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent >> call last):\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", >> line 158, in main\n >> sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", >> line 213, in run\n for work_request in >> self._control_stub.Control(get_responses()):\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line >> 416, in __next__\n return self._next()\n File >> \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line >> 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: >> <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = >> StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog >> timeout\"\n\tdebug_error_string = >> \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received >> from peer >> ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive >> watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: >> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" >> thread: "MainThread" >> Traceback (most recent call last): >> File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main >> "__main__", mod_spec) >> File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code >> exec(code, run_globalse >> File >> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", >> line 248, in <module> >> main(sys.argv) >> File >> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", >> line 158, in main >> sdk_pipeline_options.view_as(ProfilingOptions))).run() >> File >> "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", >> line 213, in run >> for work_request in self._control_stub.Control(get_responses()): >> File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", >> line 416, in __next__ >> return self._next() >> File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", >> line 706, in _next >> raise self >> grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC >> that terminated with: >> status = StatusCode.UNAVAILABLE >> details = "keepalive watchdog timeout" >> debug_error_string = >> "{"created":"@1596563347.420024732","description":"Error received from peer >> ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive >> watchdog timeout","grpc_status":14}" >> ``` >> >> I originally raised the issue in tensorflow-data-validation community but >> we couldn't come up with any solution. >> https://github.com/tensorflow/data-validation/issues/133 >> >> The beam version is 2.22.0. Please let me know if I missed anything. >> >> Thanks, >> Junjian >> >
worker_stack
Description: Binary data
master_exit_error
Description: Binary data