Another person reported something similar for Dataflow and it seemed as though in their scenario they were using locks and either got into a deadlock or starved processing for long enough that the watchdog also failed. Are you using locks and/or having really long single element processing times?
On Mon, Aug 24, 2020 at 1:50 AM Junjian Xu <j...@indeed.com> wrote: > Hi, > > I’m running into a problem of tensorflow-data-validation with direct > runner to generate statistics from some large datasets over 400GB. > > It seems that all workers stopped working after an error message of > “Keepalive watchdog fired. Closing transport.” It seems to be a grpc > keepalive timeout. > > ``` > E0804 17:49:07.419950276 44806 chttp2_transport.cc:2881] > ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport. > 2020-08-04 17:49:07 local_job_service.py : INFO Worker: severity: ERROR > timestamp { seconds: 1596563347 nanos: 420487403 } message: "Python sdk > harness failed: \nTraceback (most recent call last):\n File > \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", > line 158, in main\n > sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File > \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", > line 213, in run\n for work_request in > self._control_stub.Control(get_responses()):\n File > \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line > 416, in __next__\n return self._next()\n File > \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line > 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: > <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = > StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog > timeout\"\n\tdebug_error_string = > \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received > from peer > ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive > watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent > call last):\n File > \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", > line 158, in main\n > sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File > \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", > line 213, in run\n for work_request in > self._control_stub.Control(get_responses()):\n File > \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line > 416, in __next__\n return self._next()\n File > \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line > 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: > <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = > StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog > timeout\"\n\tdebug_error_string = > \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received > from peer > ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive > watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: > "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" > thread: "MainThread" > Traceback (most recent call last): > File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main > "__main__", mod_spec) > File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code > exec(code, run_globalse > File > "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", > line 248, in <module> > main(sys.argv) > File > "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", > line 158, in main > sdk_pipeline_options.view_as(ProfilingOptions))).run() > File > "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 213, in run > for work_request in self._control_stub.Control(get_responses()): > File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", > line 416, in __next__ > return self._next() > File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", > line 706, in _next > raise self > grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC > that terminated with: > status = StatusCode.UNAVAILABLE > details = "keepalive watchdog timeout" > debug_error_string = > "{"created":"@1596563347.420024732","description":"Error received from peer > ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive > watchdog timeout","grpc_status":14}" > ``` > > I originally raised the issue in tensorflow-data-validation community but > we couldn't come up with any solution. > https://github.com/tensorflow/data-validation/issues/133 > > The beam version is 2.22.0. Please let me know if I missed anything. > > Thanks, > Junjian >