Re: tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-25 Thread Junjian Xu
Hi, Thank you for your response. I am not using apache-beam directly but using tensorflow-data-validation API, so I'm sure about if there is any deadlock or not. https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_tfrecord But what I can tell is that I

Re: tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Luke Cwik
Another person reported something similar for Dataflow and it seemed as though in their scenario they were using locks and either got into a deadlock or starved processing for long enough that the watchdog also failed. Are you using locks and/or having really long single element processing times?

tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Junjian Xu
Hi, I’m running into a problem of tensorflow-data-validation with direct runner to generate statistics from some large datasets over 400GB. It seems that all workers stopped working after an error message of “Keepalive watchdog fired. Closing transport.” It seems to be a grpc keepalive timeout.