Hello,
for Graphite, Flink uses the DropWizard metrics reporter. I don't know
at the moment whether it supports any kind of reconnecting functionality.
I'm not sure whether i understood you correctly; did you try upgrading
the DropWizard metrics-core/metrics-graphite dependencies?
If that didn't do the trick we could in fact implement this in Flink, it
would be hack though. When an error occurs we can simply re-instantiate
the reporter, but we would have to know how the reporter communicates
the connection drop; i.e. whether it throws some exception or not.
Could you check the log for a warning statements from the MetricRegistry?
Regards,
Chesnay
On 05.05.2017 13:26, Bruno Aranda wrote:
Hi,
We are using the Graphite reporter from Flink 1.2.0 to send the
metrics via TCP. Due to our network configuration we cannot use UDP at
the moment.
We have observed that if there is any problem with graphite our the
network, basically, the TCP connection times out or something, the
metrics reporter does not recover. This is easy to reproduce by
blocking the port we are sending the metrics using iptables. If we
block the port for more than a minute or so, the problem will happen.
After the port is re-open, Flink does not continue like before.
Is this a known issue? Googling shows some problems with the
metrics-graphite package that should have been solved already. We have
trying updated metrics-core/graphite to the latest with no success.
Any ideas?
Thanks!
Bruno