I guess there is a bigger issue here. We dropped the property to 500. We
also realized that this failure happened on a TM that had one specific job
running on it. What was good ( but surprising ) that the exception was the
more protocol specific 413  ( as in the chunk is greater then some size
limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2,
code=413, message=, url=
https://app.datadoghq.com/api/v1/series?api_key=**********}
<https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6}>
)

which implies that the Socket timeout was masking this issue. The 2000 was
just a huge payload that DD was unable to parse in time ( or was slow to
upload etc ). Now we could go lower but that makes less sense. We could
play with
https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope
to reduce the size of the tags ( or keys ).









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> If we look at this
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpReporter.java#L159>
> code , the metrics are divided into chunks up-to a max size. and enqueued
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L110>.
> The Request
> <https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-metrics/flink-metrics-datadog/src/main/java/org/apache/flink/metrics/datadog/DatadogHttpClient.java#L75>
> has a 3 second read/connect/write timeout which IMHO should have been
> configurable ( or is it ) . While the number metrics ( all metrics )
> exposed by flink cluster is pretty high ( and the names of the metrics
> along with tags ) , it may make sense to limit the number of metrics in a
> single chunk ( to ultimately limit the size of a single chunk ). There is
> this configuration which allows for reducing the metrics in a single chunk
>
> metrics.reporter.dghttp.maxMetricsPerRequest: 2000
>
> We could decrease this to 1500 ( 1500 is pretty, not based on any
> empirical reasoning ) and see if that stabilizes the dispatch. It is
> inevitable that the number of requests will grow and we may hit the
> throttle but then we know the exception rather than the timeouts that are
> generally less intuitive.
>
> Any thoughts?
>
>
>
> On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Vishal,
>>
>> I have no experience in the Flink+DataDog setup but worked a bit with
>> DataDog before.
>> I'd agree that the timeout does not seem like a rate limit. It would also
>> be odd that the other TMs with a similar rate still pass. So I'd suspect
>> n/w issues.
>> Can you log into the TM's machine and try out manually how the system
>> behaves?
>>
>> On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <
>> vishal.santo...@gmail.com> wrote:
>>
>>> Hello folks,
>>>                   This is quite strange. We see a TM stop reporting
>>> metrics to DataDog .The logs from that specific TM  for every DataDog
>>> dispatch time out with* java.net.SocketTimeoutException: timeout *and
>>> that seems to repeat over every dispatch to DataDog. It seems it is on a 10
>>> seconds cadence per container. The TM remains humming, so does not seem to
>>> be under memory/CPU distress. And the exception is *not* transient. It
>>> just stops dead and from there on timeout.
>>>
>>> Looking at SLA provided by DataDog any throttling exception should
>>> pretty much not be a SocketTimeOut, till of course the reporting the
>>> specific issue is off. This thus appears very much a n/w issue which
>>> appears weird as other TMs with the same n/w just hum along, sending their
>>> metrics successfully. The other issue could be just the amount of metrics
>>> and the current volume for the TM is prohibitive. That said the exception
>>> is still not helpful.
>>>
>>> Any ideas from folks who have used DataDog reporter with Flink. I guess
>>> even best practices may be a sufficient beginning.
>>>
>>> Regards.
>>>
>>>

Reply via email to