Re: Java Process Memory Leak

2014-03-18 Thread Craig Muchinsky
Hi Young,

You are correct, I didn't catch that you were using 1.0.0 during my first 
read. I submitted GIRAPH-871 for the netty 4 specific problem I found 
against the 1.1.0-SNAPSHOT code.

Thanks,
Craig M.



From:   Young Han young@uwaterloo.ca
To: user@giraph.apache.org
Date:   03/17/2014 05:36 PM
Subject:Re: Java Process Memory Leak



Interesting find.. It looks that bit was added recently (
https://reviews.apache.org/r/17644/diff/3/) and so was not part of Giraph 
1.0.0 as far as I can tell.

Also, if anyone cares, a clunky (Ubuntu) workaround I'm using is: kill 
$(ps aux | grep [j]obcache/job_[0-9]\{12\}_[0-9]\{4\}/ | awk '{print 
$2}')

Thanks,
Young



On Mon, Mar 17, 2014 at 6:10 PM, Craig Muchinsky cmuch...@us.ibm.com 
wrote:
I just noticed a similar problem myself. I did a thread dump and found 
similar netty client threads lingering. After poking around the source a 
bit, I'm wondering if the problem is related to this bit of code I found 
in the NettyClient.stop() method: 

workerGroup.shutdownGracefully(); 
ProgressableUtils.awaitTerminationFuture(executionGroup, 
context); 
if (executionGroup != null) { 
  executionGroup.shutdownGracefully(); 
  ProgressableUtils.awaitTerminationFuture(executionGroup, 
context); 
} 

Notice that the first await termination call seems to be waiting on the 
executionGroup instead of the workerGroup... 

Craig M. 



From:Young Han young@uwaterloo.ca 
To:user@giraph.apache.org 
Date:03/17/2014 03:25 PM 
Subject:Re: Java Process Memory Leak 




Oh, I see. I did jstack on a cluster of machines and a single machine... 
I'm not quite sure how to interpret the output. My best guess is that 
there might be a deadlock---there's just a bunch of Netty threads waiting. 
The links to the jstack dumps:

http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505 
graph from SNAP)
http://pastebin.com/MNEUELui   (MST, from one of the 64 workers, com-orkut 
graph from SNAP)

Any idea what's happening? Or anything in particular I should look for 
next?

Thanks,
Young 


On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching ach...@apache.org wrote: 
Hi Young,

Our Hadoop instance (Corona) kills processes after they finish executing 
so we don't see this.  You might want to do a jstack to see where it's 
hung up on and figure out the issue.

Thanks

Avery 


On 3/17/14, 7:56 AM, Young Han wrote: 
Hi all,

With Giraph 1.0.0, I've noticed an issue where the Java process 
corresponding to the job loiters around indefinitely even after the job 
completes (successfully). The process consumes memory but not CPU time. 
This happens on both a single machine and clusters of machines (in which 
case every worker has the issue). The only way I know of fixing this is 
killing the Java process manually---restarting or stopping Hadoop does not 
help.

Is this some known bug or a configuration issue on my end?

Thanks,
Young 





Java Process Memory Leak

2014-03-17 Thread Young Han
Hi all,

With Giraph 1.0.0, I've noticed an issue where the Java process
corresponding to the job loiters around indefinitely even after the job
completes (successfully). The process consumes memory but not CPU time.
This happens on both a single machine and clusters of machines (in which
case every worker has the issue). The only way I know of fixing this is
killing the Java process manually---restarting or stopping Hadoop does not
help.

Is this some known bug or a configuration issue on my end?

Thanks,
Young


Re: Java Process Memory Leak

2014-03-17 Thread Avery Ching

Hi Young,

Our Hadoop instance (Corona) kills processes after they finish executing 
so we don't see this.  You might want to do a jstack to see where it's 
hung up on and figure out the issue.


Thanks

Avery

On 3/17/14, 7:56 AM, Young Han wrote:

Hi all,

With Giraph 1.0.0, I've noticed an issue where the Java process 
corresponding to the job loiters around indefinitely even after the 
job completes (successfully). The process consumes memory but not CPU 
time. This happens on both a single machine and clusters of machines 
(in which case every worker has the issue). The only way I know of 
fixing this is killing the Java process manually---restarting or 
stopping Hadoop does not help.


Is this some known bug or a configuration issue on my end?

Thanks,
Young




Re: Java Process Memory Leak

2014-03-17 Thread Young Han
Oh, I see. I did jstack on a cluster of machines and a single machine...
I'm not quite sure how to interpret the output. My best guess is that there
might be a deadlock---there's just a bunch of Netty threads waiting. The
links to the jstack dumps:

http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505 graph
from SNAP)
http://pastebin.com/MNEUELui   (MST, from one of the 64 workers, com-orkut
graph from SNAP)

Any idea what's happening? Or anything in particular I should look for next?

Thanks,
Young


On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching ach...@apache.org wrote:

 Hi Young,

 Our Hadoop instance (Corona) kills processes after they finish executing
 so we don't see this.  You might want to do a jstack to see where it's hung
 up on and figure out the issue.

 Thanks

 Avery


 On 3/17/14, 7:56 AM, Young Han wrote:

 Hi all,

 With Giraph 1.0.0, I've noticed an issue where the Java process
 corresponding to the job loiters around indefinitely even after the job
 completes (successfully). The process consumes memory but not CPU time.
 This happens on both a single machine and clusters of machines (in which
 case every worker has the issue). The only way I know of fixing this is
 killing the Java process manually---restarting or stopping Hadoop does not
 help.

 Is this some known bug or a configuration issue on my end?

 Thanks,
 Young





Re: Java Process Memory Leak

2014-03-17 Thread Craig Muchinsky
I just noticed a similar problem myself. I did a thread dump and found 
similar netty client threads lingering. After poking around the source a 
bit, I'm wondering if the problem is related to this bit of code I found 
in the NettyClient.stop() method:

workerGroup.shutdownGracefully();
ProgressableUtils.awaitTerminationFuture(executionGroup, 
context);
if (executionGroup != null) {
  executionGroup.shutdownGracefully();
  ProgressableUtils.awaitTerminationFuture(executionGroup, 
context);
}

Notice that the first await termination call seems to be waiting on the 
executionGroup instead of the workerGroup...

Craig M.



From:   Young Han young@uwaterloo.ca
To: user@giraph.apache.org
Date:   03/17/2014 03:25 PM
Subject:Re: Java Process Memory Leak



Oh, I see. I did jstack on a cluster of machines and a single machine... 
I'm not quite sure how to interpret the output. My best guess is that 
there might be a deadlock---there's just a bunch of Netty threads waiting. 
The links to the jstack dumps:

http://pastebin.com/0cLuaF07 (PageRank, single worker, amazon0505 
graph from SNAP)
http://pastebin.com/MNEUELui   (MST, from one of the 64 workers, com-orkut 
graph from SNAP)

Any idea what's happening? Or anything in particular I should look for 
next?

Thanks,
Young


On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching ach...@apache.org wrote:
Hi Young,

Our Hadoop instance (Corona) kills processes after they finish executing 
so we don't see this.  You might want to do a jstack to see where it's 
hung up on and figure out the issue.

Thanks

Avery


On 3/17/14, 7:56 AM, Young Han wrote:
Hi all,

With Giraph 1.0.0, I've noticed an issue where the Java process 
corresponding to the job loiters around indefinitely even after the job 
completes (successfully). The process consumes memory but not CPU time. 
This happens on both a single machine and clusters of machines (in which 
case every worker has the issue). The only way I know of fixing this is 
killing the Java process manually---restarting or stopping Hadoop does not 
help.

Is this some known bug or a configuration issue on my end?

Thanks,
Young




Re: Java Process Memory Leak

2014-03-17 Thread Young Han
Interesting find.. It looks that bit was added recently (
https://reviews.apache.org/r/17644/diff/3/) and so was not part of Giraph
1.0.0 as far as I can tell.

Also, if anyone cares, a clunky (Ubuntu) workaround I'm using is: kill $(ps
aux | grep [j]obcache/job_[0-9]\{12\}_[0-9]\{4\}/ | awk '{print $2}')

Thanks,
Young



On Mon, Mar 17, 2014 at 6:10 PM, Craig Muchinsky cmuch...@us.ibm.comwrote:

 I just noticed a similar problem myself. I did a thread dump and found
 similar netty client threads lingering. After poking around the source a
 bit, I'm wondering if the problem is related to this bit of code I found in
 the NettyClient.stop() method:

 workerGroup.shutdownGracefully();
 ProgressableUtils.*awaitTerminationFuture*(*executionGroup*,
 context);
 *if* (executionGroup != *null*) {
   executionGroup.shutdownGracefully();
   ProgressableUtils.*awaitTerminationFuture*(executionGroup,
 context);
 }

 Notice that the first await termination call seems to be waiting on the
 executionGroup instead of the workerGroup...

 Craig M.



 From:Young Han young@uwaterloo.ca
 To:user@giraph.apache.org
 Date:03/17/2014 03:25 PM
 Subject:Re: Java Process Memory Leak
 --



 Oh, I see. I did jstack on a cluster of machines and a single machine...
 I'm not quite sure how to interpret the output. My best guess is that there
 might be a deadlock---there's just a bunch of Netty threads waiting. The
 links to the jstack dumps:

 *http://pastebin.com/0cLuaF07* http://pastebin.com/0cLuaF07
 (PageRank, single worker, amazon0505 graph from SNAP)
 *http://pastebin.com/MNEUELui* http://pastebin.com/MNEUELui   (MST,
 from one of the 64 workers, com-orkut graph from SNAP)

 Any idea what's happening? Or anything in particular I should look for
 next?

 Thanks,
 Young


 On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching 
 *ach...@apache.org*ach...@apache.org
 wrote:
 Hi Young,

 Our Hadoop instance (Corona) kills processes after they finish executing
 so we don't see this.  You might want to do a jstack to see where it's hung
 up on and figure out the issue.

 Thanks

 Avery


 On 3/17/14, 7:56 AM, Young Han wrote:
 Hi all,

 With Giraph 1.0.0, I've noticed an issue where the Java process
 corresponding to the job loiters around indefinitely even after the job
 completes (successfully). The process consumes memory but not CPU time.
 This happens on both a single machine and clusters of machines (in which
 case every worker has the issue). The only way I know of fixing this is
 killing the Java process manually---restarting or stopping Hadoop does not
 help.

 Is this some known bug or a configuration issue on my end?

 Thanks,
 Young