Hi, I’m running into a problem of tensorflow-data-validation with direct runner to generate statistics from some large datasets over 400GB.
It seems that all workers stopped working after an error message of “Keepalive watchdog fired. Closing transport.” It seems to be a grpc keepalive timeout. ``` E0804 17:49:07.419950276 44806 chttp2_transport.cc:2881] ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport. 2020-08-04 17:49:07 local_job_service.py : INFO Worker: severity: ERROR timestamp { seconds: 1596563347 nanos: 420487403 } message: "Python sdk harness failed: \nTraceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent call last):\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\", line 158, in main\n sdk_pipeline_options.view_as(ProfilingOptions))).run()\n File \"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\", line 213, in run\n for work_request in self._control_stub.Control(get_responses()):\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 416, in __next__\n return self._next()\n File \"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line 706, in _next\n raise self\ngrpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog timeout\"\n\tdebug_error_string = \"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received from peer ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location: "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161" thread: "MainThread" Traceback (most recent call last): File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code exec(code, run_globalse File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 248, in <module> main(sys.argv) File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 158, in main sdk_pipeline_options.view_as(ProfilingOptions))).run() File "/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 213, in run for work_request in self._control_stub.Control(get_responses()): File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py", line 706, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "keepalive watchdog timeout" debug_error_string = "{"created":"@1596563347.420024732","description":"Error received from peer ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive watchdog timeout","grpc_status":14}" ``` I originally raised the issue in tensorflow-data-validation community but we couldn't come up with any solution. https://github.com/tensorflow/data-validation/issues/133 The beam version is 2.22.0. Please let me know if I missed anything. Thanks, Junjian