(setting this field is currently not possible from a Flink user
perspective; it is something I will investigate)
On 1/27/2021 10:30 AM, Chesnay Schepler wrote:
Yes, I could see how the memory issue can occur.
However, it should be limited to buffering 64 requests; this is the
default limit that okhttp imposes on concurrent calls.
Maybe lowering this value already does the trick.
On 1/27/2021 5:52 AM, Xingcan Cui wrote:
Hi all,
Recently, I tried to use the Datadog reporter to collect some
user-defined metrics. Sometimes when reaching traffic peaks (which
are also peaks for metrics), the HTTP client will throw the following
exception:
```
[OkHttp https://app.datadoghq.com/.. <https://app.datadoghq.com/..>.]
WARN org.apache.flink.metrics.datadog.DatadogHttpClient - Failed
sending request to Datadog
java.net.SocketTimeoutException: timeout
at
okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at
okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at
okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at
okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
I guess this may be caused by the rate limit of the Datadog server
since too many HTTP requests look like a kind of "attack". The real
problem is that after throwing the above exceptions, the JVM heap
size of the taskmanager starts to increase and finally causes OOM.
I'm curious if this may be caused by metrics accumulation, i.e., for
some reason, the client can't reconnect to the Datadog server and
send the metrics so that the metrics data is buffered in memory and
causes OOM.
I'm running Flink 1.11.2 on EMR-6.2.0 with
flink-metrics-datadog-1.11.2.jar.
Thanks,
Xingcan