subject:"Re\: JVM Non Heap Memory"

Re: JVM Non Heap Memory

2016-12-05 Thread Chesnay Schepler

Hey Daniel,

the fix won't make it into 1.1.4 since it is only relevant if you're
using Flink Meters together
with either the Graphite or Ganglia Reporter. The Meter metric is
however not available in

1.1 at all, so it can't be the underlying cause.

My fix is only for 1.2; the fixed issue could have caused the behavior.

Now, for clarification, the "metrics-meter-tick-thread-X" threads are
not created by Flink.
With Meter's being out of the picture i thus think this is not an issue
of Flink's

metric system.

Instead I believe kafka may be the culprit, I found a similar
description here:

https://issues.apache.org/jira/browse/KAFKA-1521

Which kafka version are you using? Kafka internally also uses the
DropWizard library,
and a particular version (2.2.0) of that is apparently known to be
leaking threads.

Regards,
Chesnay

On 05.12.2016 17:30, Ufuk Celebi wrote:

Quick question since the Meter issue does _not_ apply to 1.1.3, which Flink
metrics are you using?

– Ufuk

On 5 December 2016 at 16:44:47, Daniel Santos (dsan...@cryptolab.net) wrote:

Hello,

Thank you all for the kindly reply.

I've got the general idea. I am using version flink's 1.1.3.

So it seems the fix of Meter's won't make it to 1.1.4 ?

Best Regards,

Daniel Santos

On 12/05/2016 01:28 PM, Chesnay Schepler wrote:

We don't have to include it in 1.1.4 since Meter's do not exist in
1.1; my bad for tagging it in JIRA for 1.1.4.

On 05.12.2016 14:18, Ufuk Celebi wrote:

Just to note that the bug mentioned by Chesnay does not invalidate
Stefan's comments. ;-)

Chesnay's issue is here:
https://issues.apache.org/jira/browse/FLINK-5261

I added an issue to improve the documentation about cancellation
(https://issues.apache.org/jira/browse/FLINK-5260).

Which version of Flink are you using? Chesnay's fix will make it into
the upcoming 1.1.4 release.

On 5 December 2016 at 14:04:49, Chesnay Schepler (ches...@apache.org)
wrote:

Hello Daniel,
I'm afraid you stumbled upon a bug in Flink. Meters were not properly
cleaned up, causing the underlying dropwizard meter update threads to
not be shutdown either.
I've opened a JIRA
and will open a PR soon.
Thank your for reporting this issue.
Regards,
Chesnay
On 05.12.2016 12:05, Stefan Richter wrote:

Hi Daniel,

the behaviour you observe looks like some threads are not canceled.
Thread cancelation in Flink (and Java in general) is always
cooperative, where cooperative means that the thread you want to
cancel should somehow check cancelation and react to it. Sometimes
this also requires some effort from the client that wants to cancel a
thread. So if you implement e.g. custom operators or functions with
aerospike, you must ensure that they a) react on cancelation and b)
cleanup their resources. If you do not consider this, your aerospike
client might stay in a blocking call forever, in particular blocking
IO calls are prone to this. What you need to ensure is that
cancelation from the clients includes closing IO resources such as
streams to unblock the thread and allow for termination. This means
that you need your code must (to a certain degree) actively
participate in Flink's task lifecycle. In Flink 1.2 we introduce a
feature called CloseableRegistry, which makes participating in this
lifecycle easier w.r.t. closing resources. For the time being, you
should check that Flink’s task cancelation also causes your code to
close the aerospike client and check cancelation flags.

Best,
Stefan

Am 05.12.2016 um 11:42 schrieb Daniel Santos > >> >:

Hello,

I have done some threads checking and dumps. And I have disabled the
checkpointing.

Here are my findings.

I did a thread dump a few hours after I booted up the whole cluster.
(@2/12/2016; 5 TM ; 3GB HEAP each ; 7GB total each as Limit )

The dump shows that most threads are of 3 sources.
*
**OutputFlusher --- 634 -- Sleeping State*

"OutputFlusher" - Thread t@4758
java.lang.Thread.State: TIMED_WAITING
at java.lang.Thread.sleep(Native Method)
at
org.apache.flink.streaming.runtime.io.StreamRecordWriter$OutputFlusher.run(StreamRecordWriter.java:164)

Locked ownable synchronizers:
- None
*
**Metrics --- 376 ( Flink Metrics Reporter it's the only metrics
being used ) -- Parked State*

at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)

at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)

at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)

Re: JVM Non Heap Memory

2016-12-05 Thread Ufuk Celebi

Quick question since the Meter issue does _not_ apply to 1.1.3, which Flink 
metrics are you using?

– Ufuk

On 5 December 2016 at 16:44:47, Daniel Santos (dsan...@cryptolab.net) wrote:
> Hello,
>  
> Thank you all for the kindly reply.
>  
> I've got the general idea. I am using version flink's 1.1.3.
>  
> So it seems the fix of Meter's won't make it to 1.1.4 ?
>  
> Best Regards,
>  
> Daniel Santos
>  
>  
> On 12/05/2016 01:28 PM, Chesnay Schepler wrote:
> > We don't have to include it in 1.1.4 since Meter's do not exist in
> > 1.1; my bad for tagging it in JIRA for 1.1.4.
> >
> > On 05.12.2016 14:18, Ufuk Celebi wrote:
> >> Just to note that the bug mentioned by Chesnay does not invalidate
> >> Stefan's comments. ;-)
> >>
> >> Chesnay's issue is here:
> >> https://issues.apache.org/jira/browse/FLINK-5261
> >>
> >> I added an issue to improve the documentation about cancellation
> >> (https://issues.apache.org/jira/browse/FLINK-5260).
> >>
> >> Which version of Flink are you using? Chesnay's fix will make it into
> >> the upcoming 1.1.4 release.
> >>
> >>
> >> On 5 December 2016 at 14:04:49, Chesnay Schepler (ches...@apache.org)
> >> wrote:
> >>> Hello Daniel,
> >>> I'm afraid you stumbled upon a bug in Flink. Meters were not properly
> >>> cleaned up, causing the underlying dropwizard meter update threads to
> >>> not be shutdown either.
> >>> I've opened a JIRA
> >>> and will open a PR soon.
> >>> Thank your for reporting this issue.
> >>> Regards,
> >>> Chesnay
> >>> On 05.12.2016 12:05, Stefan Richter wrote:
>  Hi Daniel,
> 
>  the behaviour you observe looks like some threads are not canceled.
>  Thread cancelation in Flink (and Java in general) is always
>  cooperative, where cooperative means that the thread you want to
>  cancel should somehow check cancelation and react to it. Sometimes
>  this also requires some effort from the client that wants to cancel a
>  thread. So if you implement e.g. custom operators or functions with
>  aerospike, you must ensure that they a) react on cancelation and b)
>  cleanup their resources. If you do not consider this, your aerospike
>  client might stay in a blocking call forever, in particular blocking
>  IO calls are prone to this. What you need to ensure is that
>  cancelation from the clients includes closing IO resources such as
>  streams to unblock the thread and allow for termination. This means
>  that you need your code must (to a certain degree) actively
>  participate in Flink's task lifecycle. In Flink 1.2 we introduce a
>  feature called CloseableRegistry, which makes participating in this
>  lifecycle easier w.r.t. closing resources. For the time being, you
>  should check that Flink’s task cancelation also causes your code to
>  close the aerospike client and check cancelation flags.
> 
>  Best,
>  Stefan
> 
> > Am 05.12.2016 um 11:42 schrieb Daniel Santos > >> >:
> >
> > Hello,
> >
> > I have done some threads checking and dumps. And I have disabled the
> > checkpointing.
> >
> > Here are my findings.
> >
> > I did a thread dump a few hours after I booted up the whole cluster.
> > (@2/12/2016; 5 TM ; 3GB HEAP each ; 7GB total each as Limit )
> >
> > The dump shows that most threads are of 3 sources.
> > *
> > **OutputFlusher --- 634 -- Sleeping State*
> >
> > "OutputFlusher" - Thread t@4758
> > java.lang.Thread.State: TIMED_WAITING
> > at java.lang.Thread.sleep(Native Method)
> > at
> > org.apache.flink.streaming.runtime.io.StreamRecordWriter$OutputFlusher.run(StreamRecordWriter.java:164)
> >   
> >
> >
> > Locked ownable synchronizers:
> > - None
> > *
> > **Metrics --- 376 ( Flink Metrics Reporter it's the only metrics
> > being used ) -- Parked State*
> >
> > "metrics-meter-tick-thread-1" - Thread t@29024
> > java.lang.Thread.State: TIMED_WAITING
> > at sun.misc.Unsafe.park(Native Method)
> > - parking to wait for (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)  
> >
> > at
> > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)  
> >
> > at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >   
> >
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
> >
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
> >   
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
> >   
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
> >   
> >
> > at
> >

Re: JVM Non Heap Memory

2016-12-05 Thread Daniel Santos

Hello,

Thank you all for the kindly reply.

I've got the general idea. I am using version flink's 1.1.3.

So it seems the fix of Meter's won't make it to 1.1.4 ?

Best Regards,

Daniel Santos

On 12/05/2016 01:28 PM, Chesnay Schepler wrote:
We don't have to include it in 1.1.4 since Meter's do not exist in
1.1; my bad for tagging it in JIRA for 1.1.4.

On 05.12.2016 14:18, Ufuk Celebi wrote:
Just to note that the bug mentioned by Chesnay does not invalidate
Stefan's comments. ;-)