Re:Re: How to know the root reason to cause RegionServer OOM?

2015-05-21 Thread David chen
Thanks Ted and Stack, @Ted, i also reproduced the OOM for scenario #1, and found the hints in the out file(it is /var/run/cloudera-scm-agent/process/*-hbase-REGIONSERVER/logs/stdout.log in CDH). @Stack, it is not indeed hbase issue, just because other application occupied more memory, then

Re:Re: How to know the root reason to cause RegionServer OOM?

2015-05-20 Thread David chen
Thanks Ted, Sorry, the log extension is .log.out, so i think the out file you said is log file. My version is HBase 0.98.6-cdh5.2.0, where is regionserver.out file? BTW, i should assure that my scenario is #2, so expect to get your snippet from .out file

Re: How to know the root reason to cause RegionServer OOM?

2015-05-20 Thread Ted Yu
For scenario #1, please check regionserver.out file - not log file. I was able to reproduce scenario #1 by giving regionserver 124mb heap. As soon as I put load on the server, server was killed by kill -9 command. I can send you snippet from .out file in the morning. Cheers On May 20,

Re: How to know the root reason to cause RegionServer OOM?

2015-05-20 Thread David chen
Thanks Ted, For scenario #1, can not see any clues in regionserver log file that denotes kill -9 command was executed. Meanwhile, i think when JVM inspects regionserver process OOME, it will create a new thread to execute kill -9 %p, the new thread should not write regionserver log, so the

Re: How to know the root reason to cause RegionServer OOM?

2015-05-20 Thread Stack
On Wed, May 20, 2015 at 1:46 AM, David chen c77...@163.com wrote: Thanks Ted, For scenario #1, can not see any clues in regionserver log file that denotes kill -9 command was executed. Meanwhile, i think when JVM inspects regionserver process OOME, it will create a new thread to execute kill

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-19 Thread David chen
Thanks for guys reply, its indeed helped me. Another question, I think there are two possibilities to kill RegionServer process: 1. When JVM inspects that the memory, RegionServer has occupied, exceed the max-heap-size, then JVM calls positively the command configured by option

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-19 Thread Ted Yu
For scenario #1, you would see in the regionserver.out file that kill -9 command was applied due to OOME. For scenario #2, can you see if dmesg provides some clue ? Cheers On Tue, May 19, 2015 at 6:32 PM, David chen c77...@163.com wrote: Thanks for guys reply, its indeed helped me. Another

Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-18 Thread Sean Busbey
On Mon, May 18, 2015 at 11:47 AM, Andrew Purtell apurt...@apache.org wrote: You need to not overcommit memory on servers running JVMs for HDFS and HBase (and YARN, and containers, if colocating Hadoop MR). Sum the -Xmx parameter, the maximum heap size, for all JVMs that will be concurrently

Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-18 Thread Andrew Purtell
You need to not overcommit memory on servers running JVMs for HDFS and HBase (and YARN, and containers, if colocating Hadoop MR). Sum the -Xmx parameter, the maximum heap size, for all JVMs that will be concurrently executing on the server. The total should be less than the total amount of RAM

Re:Re: How to know the root reason to cause RegionServer OOM?

2015-05-17 Thread David chen
The snippet in /var/log/messages is as follows, i am sure that process killed(22827) is RegsionServer. .. May 14 12:00:38 localhost kernel: Mem-Info: May 14 12:00:38 localhost kernel: Node 0 DMA per-cpu: May 14 12:00:38 localhost kernel: CPU0: hi:0, btch: 1 usd: 0 .. May 14

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-15 Thread David chen
Hi Ted, I read the code snippet, you provided HRegionServer#Scan, in 0.98.5 version, it looks like that the partial row is returned. If so, the partial row has been fixed in 0.98.5 version, why the fix version is 1.1.0 in HBASE-11544? At 2015-05-14 01:04:35, Ted Yu yuzhih...@gmail.com wrote:

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-15 Thread Ted Yu
I got '502 Bad Gateway' trying to access the post David mentioned. Here is the same article in case you get 502 error: http://java.dzone.com/articles/OOM-relation-to-swappiness FYI On Thu, May 14, 2015 at 2:40 AM, David chen c77...@163.com wrote: Thanks for guys' helps. Maybe the root reason

Re: How to know the root reason to cause RegionServer OOM?

2015-05-15 Thread iain wright
What log is this seen in? Can you paste the log line? Do you mean /var/log/messages? On May 12, 2015 7:44 PM, David chen c77...@163.com wrote: A RegionServer was killed because OutOfMemory(OOM), although the process killed can be seen in the Linux message log, but i still have two following

Re: Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-15 Thread Ted Yu
I should have mentioned in previous email that I was looking at code in branch-1 bq. why the fix version is 1.1.0 in HBASE-11544? See release note: Incompatible Change: The return type of InternalScanners#next and RegionScanners#nextRaw has been changed to NextState from boolean Cheers On Fri,

Re:Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-14 Thread David chen
Thanks for guys' helps. Maybe the root reason is to turn off swap. The cluster contains seven Region servers, although all set vm.swappiness to 0, but two of them has always turned off swap, others turned on. Meanwhile OOM also always encountered on the two machines. I plan to turn on swap and

Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-13 Thread Elliott Clark
On Wed, May 13, 2015 at 12:59 AM, David chen c77...@163.com wrote: -XX:MaxGCPauseMillis=6000 With this line you're basically telling java to never garbage collect. Can you try lowering that to something closer to the jvm default and see if you have better stability?

Re: Re: How to know the root reason to cause RegionServer OOM?

2015-05-13 Thread Ted Yu
For #2, partial row would be returned. Please take a look at the following method in RSRpcServices around line 2393 : public ScanResponse scan(final RpcController controller, final ScanRequest request) Cheers On Wed, May 13, 2015 at 12:59 AM, David chen c77...@163.com wrote: Thanks for you

Re: How to know the root reason to cause RegionServer OOM?

2015-05-13 Thread Stack
On Tue, May 12, 2015 at 7:41 PM, David chen c77...@163.com wrote: A RegionServer was killed because OutOfMemory(OOM), although the process killed can be seen in the Linux message log, but i still have two following problems: 1. How to inspect the root reason to cause OOM? Start the

Re: How to know the root reason to cause RegionServer OOM?

2015-05-13 Thread Bryan Beaudreault
After moving to the G1GC we were plagued with random OOMs from time to time. We always thought it was due to people requesting a big row or group of rows, but upon investigation noticed that the heap dumps were many GBs less than the max heap at time of OOM. If you have this symptom, you may be

Re:Re: How to know the root reason to cause RegionServer OOM?

2015-05-13 Thread David chen
Thanks for you reply. Yes, it indeed appeared in the RegionServer command as follows: jps -v|grep Region HRegionServer -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m -Djava.net.preferIPv4Stack=true -Xms16106127360 -Xmx16106127360 -XX:+UseG1GC -XX:MaxGCPauseMillis=6000

How to know the root reason to cause RegionServer OOM?

2015-05-12 Thread David chen
A RegionServer was killed because OutOfMemory(OOM), although the process killed can be seen in the Linux message log, but i still have two following problems: 1. How to inspect the root reason to cause OOM? 2 When RegionServer encounters OOM, why can't it free some memories occupied? if so,

Re: How to know the root reason to cause RegionServer OOM?

2015-05-12 Thread Ted Yu
Does the following appear in the command which launched region server ? -XX:OnOutOfMemoryError=kill -9 %p There could be multiple reasons for region server process to encounter OOME. Please take a look at HBASE-11544 which fixes a common cause. The fix is in the upcoming 1.1.0 release. Cheers