Re: Datadog reporter timeout & OOM issue

Chesnay Schepler Wed, 27 Jan 2021 01:32:28 -0800

(setting this field is currently not possible from a Flink userperspective; it is something I will investigate)


On 1/27/2021 10:30 AM, Chesnay Schepler wrote:

Yes, I could see how the memory issue can occur.
However, it should be limited to buffering 64 requests; this is thedefault limit that okhttp imposes on concurrent calls.
Maybe lowering this value already does the trick.

On 1/27/2021 5:52 AM, Xingcan Cui wrote:
Hi all,
Recently, I tried to use the Datadog reporter to collect someuser-defined metrics. Sometimes when reaching traffic peaks (whichare also peaks for metrics), the HTTP client will throw the followingexception:
```
[OkHttp https://app.datadoghq.com/.. <https://app.datadoghq.com/..>.]WARN org.apache.flink.metrics.datadog.DatadogHttpClient - Failedsending request to Datadog
java.net.SocketTimeoutException: timeout
atokhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)atokhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)atokhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)atokhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)atokhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)atokhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)atokhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)atokhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
I guess this may be caused by the rate limit of the Datadog serversince too many HTTP requests look like a kind of "attack". The realproblem is that after throwing the above exceptions, the JVM heapsize of the taskmanager starts to increase and finally causes OOM.I'm curious if this may be caused by metrics accumulation, i.e., forsome reason, the client can't reconnect to the Datadog server andsend the metrics so that the metrics data is buffered in memory andcauses OOM.
I'm running Flink 1.11.2 on EMR-6.2.0 withflink-metrics-datadog-1.11.2.jar.
Thanks,
Xingcan

Re: Datadog reporter timeout & OOM issue

Reply via email to