Re: [External Sender] Re: ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue

2020-12-08 Thread Kye Bae
Hello, Piotr.

Thank you.

This is an error logged to the taskmanager just before it became "lost" to
the jobmanager (i.e., reported as "lost" in the jobmanager log just before
the job restart). In what context would this particular error (not the
root-root cause you referred to) be thrown from a taskmanager? E.g., any
point in the pipeline that involves communicating with other non-collocated
tasks running on other taskmanagers? Or with the jobmanager?

-K

On Tue, Dec 8, 2020 at 3:19 AM Piotr Nowojski  wrote:

> Hi Kye,
>
> Almost for sure this error is not the primary cause of the failure. This
> error means that the node reporting it, has detected some fatal failure on
> the other side of the wire (connection reset by peer), but the original
> error is somehow too slow or unable to propagate to the JobManager before
> this secondary exception. Something else must have failed/crashed/caused,
> so you should look for that something. This something can be:
> 1. TaskManager on the other end has crashed with some error - please look
> for some errors or warning in other task managers logs
> 2. OOM or some other JVM failure - again please look at the logs on other
> machines (maybe system logs)
> 3. Some OS failure - please look at the system logs on other machines
> 4. Some hardware failure (restart / crash)
> 5. Network problems
>
> Piotrek
>
> pon., 7 gru 2020 o 23:31 Kye Bae  napisał(a):
>
>> I forgot to mention: this is Flink 1.10.
>>
>> -K
>>
>> On Mon, Dec 7, 2020 at 5:08 PM Kye Bae  wrote:
>>
>>> Hello!
>>>
>>> We have a real-time streaming workflow that has been running for about
>>> 2.5 weeks.
>>>
>>> Then, we began to get the exception below from taskmanagers (random)
>>> since yesterday, and the job began to fail/restart every hour or so.
>>>
>>> The job does recover after each restart, but sometimes it takes more
>>> time to recover than allowed in our environment. On a few occasions, it
>>> took more than a few restarts to fully recover.
>>>
>>> Can you provide some insight into what this error means and also what we
>>> can do to prevent this in future?
>>>
>>> Thank you!
>>>
>>> +++
>>> ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue  -
>>> Encountered error while consuming partitions
>>> java.io
>>> <https://urldefense.com/v3/__http://java.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_LEtbBVlg$>.IOException:
>>> Connection reset by peer
>>> at sun.nio.ch
>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$>.FileDispatcherImpl.read0(Native
>>> Method)
>>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> at sun.nio.ch
>>> <https://urldefense.com/v3/__http://sun.nio.ch/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_Lj-CBwHw$>
>>> .IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>> at org.apache.flink.shaded.netty4.io
>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>> .netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
>>> at org.apache.flink.shaded.netty4.io
>>> <https://urldefense.com/v3/__http://org.apache.flink.shaded.netty4.io/__;!!EFVe01R3CjU!NUoIha4XyuOfu-V-wni1kiKiIyjjXaprElbqdFKZPNj5SkiDttNIjMbEg_KrMQo4YQ$>
>>> .netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
>>> at
>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
>>> at org.apache.flink.shaded.netty4.io
>>> <https://urldefense.com/v

Re: ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue

2020-12-07 Thread Kye Bae
I forgot to mention: this is Flink 1.10.

-K

On Mon, Dec 7, 2020 at 5:08 PM Kye Bae  wrote:

> Hello!
>
> We have a real-time streaming workflow that has been running for about 2.5
> weeks.
>
> Then, we began to get the exception below from taskmanagers (random) since
> yesterday, and the job began to fail/restart every hour or so.
>
> The job does recover after each restart, but sometimes it takes more time
> to recover than allowed in our environment. On a few occasions, it took
> more than a few restarts to fully recover.
>
> Can you provide some insight into what this error means and also what we
> can do to prevent this in future?
>
> Thank you!
>
> +++
> ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue  -
> Encountered error while consuming partitions
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.flink.shaded.netty4.io
> .netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
> at org.apache.flink.shaded.netty4.io
> .netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
> at org.apache.flink.shaded.netty4.io
> .netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
> at org.apache.flink.shaded.netty4.io
> .netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at java.lang.Thread.run(Thread.java:748)
>

__



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.





ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue

2020-12-07 Thread Kye Bae
Hello!

We have a real-time streaming workflow that has been running for about 2.5
weeks.

Then, we began to get the exception below from taskmanagers (random) since
yesterday, and the job began to fail/restart every hour or so.

The job does recover after each restart, but sometimes it takes more time
to recover than allowed in our environment. On a few occasions, it took
more than a few restarts to fully recover.

Can you provide some insight into what this error means and also what we
can do to prevent this in future?

Thank you!

+++
ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue  -
Encountered error while consuming partitions
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.flink.shaded.netty4.io
.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
at org.apache.flink.shaded.netty4.io
.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1140)
at
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
at org.apache.flink.shaded.netty4.io
.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io
.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)

__



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.





Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

2020-11-17 Thread Kye Bae
It is possible, but I am not entirely sure about the load order affecting
the metaspace usage.

To find out why your taskmanager container is exceeding the metaspace, we
would need to know what value the max metaspace size is set to and then
find out how much of the metaspace is actually being used and what the
classloading stats are (before Yarn kills the container).

If possible, please share the current metaspace setting and the native
memory stats from one of your taskmanager instances (Java process as a Yarn
container). I think those run as the yarn user as opposed to the hadoop
user (it depends on your environment).

# enable the native memory tracking referenced earlier - this may need to
be passed in as the taskmanager JVM option parameter not the jobmanager.
# log onto one of the taskmanager HW instances
sudo su - yarn # or log on as the user that runs Yarn
jps -v # grab the PID for one of the processes named YarnTaskExecutorRunner
- this would be a taskmanager
jcmd TM_PID VM.native_memory summary # size information for "Class" -
metaspace
jcmd TM_PID VM.classloader_stats # how many classes were loaded by which
classloader, sizes, etc.

There is a gentleman who posts blogs on Flink, and he goes into a bit more
detail there. This was against Flink 1.9, but the foundational concepts
still apply to 1.10.
http://cloudsqale.com/2020/04/29/flink-1-9-off-heap-memory-on-yarn-troubleshooting-container-is-running-beyond-physical-memory-limits-errors/

-K

On Tue, Nov 17, 2020 at 12:47 PM Flavio Pompermaier 
wrote:

> Another big potential candidate is the fact that JDBC libs I use in my job
> are put into the Flink lib folder instead of putting them into the fat
> jar..tomorrow I'll try to see if the metaspace is getting cleared correctly
> after that change.
> Unfortunately our jobs were written before the child-first / parent-first
> classloading refactoring and at that time that was the way to go..but now
> it can cause this kind of problems if using child-first policy.
>
> On Mon, Nov 16, 2020 at 8:44 PM Flavio Pompermaier 
> wrote:
>
>> Thank you Kye for your insights...in my mind, if the job runs without
>> problems one or more times the heap size, and thus the medatadata-size, is
>> big enough and I should not increase it (on the same data of course).
>> So I'll try to understand who is leaking what..the advice to avoid the
>> dynamic class loading is just a workaround to me..there's something wrong
>> going on and tomorrow I'll try to understand the root cause of the
>> problem using -XX:NativeMemoryTracking=summary as you suggested.
>>
>> I'll keep you up to date with my findings..
>>
>> Best,
>> Flavio
>>
>> On Mon, Nov 16, 2020 at 8:22 PM Kye Bae  wrote:
>>
>>> Hello!
>>>
>>> The JVM metaspace is where all the classes (not class instances or
>>> objects) get loaded. jmap -histo is going to show you the heap space usage
>>> info not the metaspace.
>>>
>>> You could inspect what is happening in the metaspace by using jcmd (e.g.,
>>> jcmd JPID VM.native_memory summary) after restarting the cluster with "
>>> *-XX:NativeMemoryTracking=summary"*
>>>
>>> *As the error message suggests, you may need to increase 
>>> *taskmanager.memory.jvm-metaspace.size,
>>> but you need to be slightly careful when specifying the memory parameters
>>> in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error
>>> message).
>>>
>>> Anothing thing to keep in mind is that you may want to avoid using
>>> dynamic classloading (
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code
>>> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html*avoiding-dynamic-classloading-for-user-code__;Iw!!EFVe01R3CjU!L48mrpYG-_Tcldpp26edL4MKDU3RIHFmX89E-BzmJk6_RiLUqQUFvy4RHE5kTFVDig$>):
>>> when the job continuously fails for some temporary issues, it will load the
>>> same class files into the metaspace multiple times and could exceed
>>> whatever the limit you set it.
>>>
>>> -K
>>>
>>> On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský  wrote:
>>>
>>>> The exclusions should not have any impact on that, because what defines
>>>> which classloader will load which class is not the presence or particular
>>>> class in a specific jar, but the configuration of parent-first-patterns 
>>>> [1].
>>>>
>>>> If you don't use any flink internal imports, than it still might be the
>>>> case, th

Re: [External Sender] Re: Random Task executor shutdown (java.lang.OutOfMemoryError: Metaspace)

2020-11-16 Thread Kye Bae
Hello!

The JVM metaspace is where all the classes (not class instances or objects)
get loaded. jmap -histo is going to show you the heap space usage info not
the metaspace.

You could inspect what is happening in the metaspace by using jcmd (e.g.,
jcmd JPID VM.native_memory summary) after restarting the cluster with "
*-XX:NativeMemoryTracking=summary"*

*As the error message suggests, you may need to increase
*taskmanager.memory.jvm-metaspace.size,
but you need to be slightly careful when specifying the memory parameters
in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error
message).

Anothing thing to keep in mind is that you may want to avoid using dynamic
classloading (
https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code):
when the job continuously fails for some temporary issues, it will load the
same class files into the metaspace multiple times and could exceed
whatever the limit you set it.

-K

On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský  wrote:

> The exclusions should not have any impact on that, because what defines
> which classloader will load which class is not the presence or particular
> class in a specific jar, but the configuration of parent-first-patterns [1].
>
> If you don't use any flink internal imports, than it still might be the
> case, that a class in any of the packages defined by the
> parent-first-pattern to hold reference to your user-code classes, which
> would cause the leak. I'd recommend to inspect the heap dump after several
> restarts of the application and look for reference to Class objects from
> the root set.
>
> Jan
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading
> 
> On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
>
> I've tried to remove all possible imports of classes not contained in the
> fat jar but I still face the same problem.
> I've also tried to reduce as much as possible the exclude in the shade
> section of the maven plugin (I took the one at [1]) so now I exclude only
> few dependencies..could it be that I should include org.slf4j:* if I use
> static import of it?
>
> 
> 
>   com.google.code.findbugs:jsr305
>   org.slf4j:*
>   log4j:*
> 
> 
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies
> 
>
> On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský  wrote:
>
>> Yes, that could definitely cause this. You should probably avoid using
>> these flink-internal shaded classes and ship your own versions (not shaded).
>>
>> Best,
>>
>>  Jan
>> On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
>>
>> Thank you Jan for your valuable feedback.
>> Could it be that I should not use import shaded-jackson classes in my
>> user code?
>> For example import
>> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
>>
>> Bets,
>> Flavio
>>
>> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský  wrote:
>>
>>> Hi Flavio,
>>>
>>> when I encountered quite similar problem that you describe, it was
>>> related to a static storage located in class that was loaded
>>> "parent-first". In my case it was it was in java.lang.ClassValue, but it
>>> might (and probably will be) different in your case. The problem is that if
>>> user-code registers something in some (static) storage located in class
>>> loaded with parent (TaskTracker) classloader, then its associated classes
>>> will never be GC'd and Metaspace will grow. A good starting point would be
>>> not to focus on biggest consumers of heap (in general), but to look at
>>> where the 15k objects of type Class are referenced from. That might help
>>> you figure this out. I'm not sure if there is something that can be done in
>>> general to prevent this type of leaks. That would be probably question on
>>> dev@ mailing list.
>>>
>>> Best,
>>>
>>>  Jan
>>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
>>>
>>> Hello everybody,
>>> I was writing this email when a similar thread on this mailing list
>>> appeared..
>>> The difference is that the other problem seems to be related with Flink
>>> 1.10 on YARN and does not output anything helpful in debugging the cause of
>>> the problem.
>>>
>>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone
>>> session cluster (the job is submitted to the cluster using the CLI client).
>>> The problem arises when I submit the same job for about 20 

Re: [External Sender] Debugging "Container is running beyond physical memory limits" on YARN for a long running streaming job

2020-09-25 Thread Kye Bae
Not sure about Flink 1.10.x. Can share a few things up to Flink 1.9.x:

1. If your Flink cluster runs only one job, avoid using dynamic classloader
for your job: start it from one of the Flink class paths. As of Flink
1.9.x, using the dynamic classloader results in the same classes getting
loaded every time the job restarts (self-recovery or otherwise), and it
could eat up all the JVM "off-heap" memory. Yarn seems to immediately kill
the container when that happens.

2. Be sure to leave enough for the JVM "off-heap" area: GC + code cache +
thread stacks + other Java internal resources end up there.

-K

On Sat, Sep 19, 2020 at 12:09 PM Shubham Kumar 
wrote:

> Hey everyone,
>
> We had deployed a streaming job using Flink 1.10.1 one month back and now
> we are encountering a Yarn container killed due to memory issues very
> frequently. I am trying to figure out the root cause of this issue in order
> to fix it.
>
> We have a streaming job whose basic structure looks like this:
> - Read 6 kafka streams and combine stats from them (union) to form a
> single stream
> - stream.keyBy(MyKey)
>  .window(TumblingEventTimeWindows.of(Time.minutes(1)))
>  .reduce(MyReduceFunction)
>  .addSink(new FlinkKafkaProducer011<>...);
>
> We are using RocksDB as state backend. In flink-conf.yaml, we used
> taskmanager.memory.process.size = 10GB with a parallelism of 12 and only
> one slot per task manager.
>
> So, a taskmanager process gets started with the following memory
> components as indicated in logs:
>
> TaskExecutor container... will be started on ... with
>> TaskExecutorProcessSpec {cpuCores=1.0, frameworkHeapSize=128.000mb (
>> 134217728 bytes), frameworkOffHeapSize=128.000mb (134217728 bytes),
>> taskHeapSize=4.125gb (4429184954 bytes), taskOffHeapSize=0 bytes,
>> networkMemSize=896.000mb (939524110 bytes), managedMemorySize=3.500gb (
>> 3758096440 bytes), jvmMetaspaceSize=256.000mb (268435456 bytes),
>> jvmOverheadSize=1024.000mb (1073741824 bytes)}.
>>
>
>>
>
>  which are as per defaults.
>
> Now, after 25 days we started encountering the following yarn container
> kill error:
>
>> Association with remote system [akka.tcp://flink@...] has failed,
>> address is now gated for [50] ms. Reason: [Association failed with
>> [akka.tcp://flink@...]] Caused by: [java.net
>> 
>> .ConnectException: Connection refused: .../...:37679]
>> 2020-09-09 00:53:24 INFO Closing TaskExecutor connection
>> container_e193_1592804717489_149347_01_11 because: [2020-09-09 00:53:
>> 21.417]Container 
>> [pid=44371,containerID=container_e193_1592804717489_149347_01_11]
>> is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB
>> physical memory used; 14.4 GB of 25.2 GB virtual memory used. Killing
>> container.
>>
>
> Yarn container size is 12GB as it is only allowed as a multiple of 3 GB
> (as per our settings).
>
> Now, when the YARN reallocates a new container, the program starts again
> (without any issues) and after a few hours another container is killed with
> the same error and the cycle repeats.
> At this point, I want to debug it as a running process without changing or
> playing around with various config options for memory as I don't think just
> to reproduce the error, I want to wait for ~1 month.
>
> I have tried to figure out something from Graphite metrics (see
> attachments):
> [1]: JVM Heap Memory (First 25 days) -> The memory goes up and after
> reaching a point goes does and again starts going up. (No container kills
> were encountered until 09/09/2020, program started on 14/08/2020)
> [2]: JVM Heap Memory (Recent) -> The memory is still going up but it seems
> it doesn't even reaches its peak, but instead container is killed before
> that itself (within a few hours)
>
> From [1] and [2], JVM heap memory should not rise up I think, but that
> doesn't explain container kill in [2] case if JVM heap memory was the issue
> causing container kill.
>
> [3]: Direct Memory and Off heap Memory -> I don't think this is causing
> the issue as most of the network buffers are free and off heap memory is
> well below threshold.
>
> At this point I thought RocksDB might be the culprit. I am aware that it
> uses the managed memory limits (I haven't changed any default config) which
> is completely off heap. But when I see the rocksDB size maintained at
> location:
>
>
>> /data_4/yarn-nm-local-dir/usercache/root/appcache/application_.../flink-io-a48d1127-58a1-41c5-a5f0-32c5180fe74d/job_0bff1881431b5774c3b496a98febed1a_op_WindowOperator_4061fbe16fb95459a1a8d207644e2e63__4_12__uuid_9fe0b2ff-24bc-4301-8044-3fe8e1b3a3a0/db/
>
>
> It is only 17MB which doesn't seem much. I also took a heap dump
> of org.apache.flink.yarn.YarnTaskExecutorRunner process but it shows only
> 30MB of data is being used (not sure what I am missing here as it doesn't
> match