Hi,

I’m running into a problem of tensorflow-data-validation with direct runner
to generate statistics from some large datasets over 400GB.

It seems that all workers stopped working after an error message of
“Keepalive watchdog fired. Closing transport.” It seems to be a grpc
keepalive timeout.

```
E0804 17:49:07.419950276   44806 chttp2_transport.cc:2881]
ipv6:[::1]:40823: Keepalive watchdog fired. Closing transport.
2020-08-04 17:49:07  local_job_service.py : INFO  Worker: severity: ERROR
timestamp {   seconds: 1596563347   nanos: 420487403 } message: "Python sdk
harness failed: \nTraceback (most recent call last):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
line 158, in main\n
 sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
line 213, in run\n    for work_request in
self._control_stub.Control(get_responses()):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
416, in __next__\n    return self._next()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
timeout\"\n\tdebug_error_string =
\"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
from peer
ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
watchdog timeout\",\"grpc_status\":14}\"\n>" trace: "Traceback (most recent
call last):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py\",
line 158, in main\n
 sdk_pipeline_options.view_as(ProfilingOptions))).run()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py\",
line 213, in run\n    for work_request in
self._control_stub.Control(get_responses()):\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
416, in __next__\n    return self._next()\n  File
\"/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py\", line
706, in _next\n    raise self\ngrpc._channel._MultiThreadedRendezvous:
<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus =
StatusCode.UNAVAILABLE\n\tdetails = \"keepalive watchdog
timeout\"\n\tdebug_error_string =
\"{\"created\":\"@1596563347.420024732\",\"description\":\"Error received
from peer
ipv6:[::1]:40823\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1055,\"grpc_message\":\"keepalive
watchdog timeout\",\"grpc_status\":14}\"\n>\n" log_location:
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py:161"
thread: "MainThread"
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globalse
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 248, in <module>
    main(sys.argv)
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker_main.py",
line 158, in main
    sdk_pipeline_options.view_as(ProfilingOptions))).run()
  File
"/home/ec2-user/lib64/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 213, in run
    for work_request in self._control_stub.Control(get_responses()):
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
line 416, in __next__
    return self._next()
  File "/home/ec2-user/lib64/python3.7/site-packages/grpc/_channel.py",
line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC
that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "keepalive watchdog timeout"
        debug_error_string =
"{"created":"@1596563347.420024732","description":"Error received from peer
ipv6:[::1]:40823","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"keepalive
watchdog timeout","grpc_status":14}"
```

I originally raised the issue in tensorflow-data-validation community but
we couldn't come up with any solution.
https://github.com/tensorflow/data-validation/issues/133

The beam version is 2.22.0. Please let me know if I missed anything.

Thanks,
Junjian

Reply via email to