Hi,
Thank you for your response.
I am not using apache-beam directly but using tensorflow-data-validation
API, so I'm sure about if there is any deadlock or not.
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord
But what I can tell is that I
Another person reported something similar for Dataflow and it seemed as
though in their scenario they were using locks and either got into a
deadlock or starved processing for long enough that the watchdog also
failed. Are you using locks and/or having really long single element
processing times?
Hi,
I’m running into a problem of tensorflow-data-validation with direct runner
to generate statistics from some large datasets over 400GB.
It seems that all workers stopped working after an error message of
“Keepalive watchdog fired. Closing transport.” It seems to be a grpc
keepalive timeout.