RE: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Gopinathan A Tue, 01 May 2012 22:16:27 -0700

Hi Alex,

You can check the system logs (/var/log/), whether system issued any kill
command for that process. This can happen if the resource utilization is
more from the RS (like memory usage) for very long time.


Are you using GZIP compression? 

Thanks & Regards,
Gopinathan A

****************************************************************************
***********
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!


-----Original Message-----
From: Alex Baranau [mailto:alex.barano...@gmail.com] 
Sent: Tuesday, May 01, 2012 1:18 AM
To: hbase-u...@hadoop.apache.org
Subject: RegionServer silently stops (only "issue": CMS-concurrent-mark
~80sec)

Hello,

During recent weeks I constantly see some RSs *silently* dying on our HBase
cluster. By "silently" I mean that process stops, but no errors in logs [1].

The only thing I can relate to it is long CMS-concurrent-mark: almost 80
seconds. But this should not cause issues as it is not a "stop-the-world"
process.

Any advice?

HBase: hbase-0.90.4-cdh3u3
Hadoop: 0.20.2-cdh3u3

Thank you,
Alex Baranau

[1]

last lines from RS log (no errors before too, and nothing written in *.out
file):

2012-04-30 18:52:11,806 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for agg-sa-1.3,0011|
te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00
\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
because regionserver60020.cacheFlusher; priority=-1, compaction queue size=0
2012-04-30 18:54:58,779 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using new
createWriter -- HADOOP-6840
2012-04-30 18:54:58,779 DEBUG
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,133570661339
7/xxx.ec2.internal%3A60020.1335812098651,
syncFs=true, hflush=false
2012-04-30 18:54:58,874 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Roll
/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1
335811856672,
entries=73789, filesize=63773934. New hlog
/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1
335812098651
2012-04-30 18:56:31,867 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up
with memory above low water.
2012-04-30 18:56:31,867 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
agg-sa-1.3,s_00I4|
tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562
\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00
\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
due to global heap pressure
2012-04-30 18:56:31,867 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for agg-sa-1.3,s_00I4|
tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562
\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00
\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
,
current region memstore size 138.1m
2012-04-30 18:56:31,867 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Finished snapshotting, commencing flushing stores
2012-04-30 18:56:56,303 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=322.84 MB,
free=476.34 MB, max=799.17 MB, blocks=5024, accesses=12189396, hits=127592,
hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949,
cachingHitsRatio=95.82%%, evictions=0, evicted=0, evictedPerRun=NaN
2012-04-30 18:56:59,026 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.t
mp/391890051647401997
to
hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/
1139737908876846168
2012-04-30 18:56:59,034 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/
1139737908876846168,
entries=476418, sequenceid=880198761, memsize=138.1m, filesize=5.7m
2012-04-30 18:56:59,097 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Finished memstore flush of ~138.1m for region agg-sa-1.3,s_00I4|
tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562
\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00
\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
in 27230ms, sequenceid=880198761, compaction requested=false
~

[2]

last lines from GC log:

2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791: [ParNew:
35638K->1118K(38336K), 0.0548970 secs] 3145651K->3111412K(4091776K)
icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00, real=0.09 secs]
2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069: [ParNew:
35230K->2224K(38336K), 0.0802440 secs] 3145524K->3112533K(4091776K)
icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-mark:
8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23 secs]
2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-preclean-start]
2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291: [ParNew:
36336K->3895K(38336K), 0.0272610 secs] 3146645K->3114281K(4091776K)
icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560: [ParNew:
37990K->3082K(38336K), 0.0504660 secs] 3148376K->3114329K(4091776K)
icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00, real=0.05 secs]
2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931: [ParNew:
37194K->3340K(38336K), 0.0419660 secs] 3148441K->3115727K(4091776K)
icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00, real=0.04 secs]
2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207: [ParNew:
37351K->3249K(38336K), 0.0695950 secs] 3149738K->3115636K(4091776K)
icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555: [ParNew:
37361K->2512K(38336K), 0.0510560 secs] 3149748K->3116166K(4091776K)
icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902: [ParNew:
36624K->2616K(38336K), 0.1986770 secs] 3150278K->3116270K(4091776K)
icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00, real=0.20 secs]
2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722: [ParNew:
36728K->3450K(38336K), 0.1099190 secs] 3150382K->3117103K(4091776K)
icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00, real=0.11 secs]
2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739: [ParNew:
37546K->3176K(38336K), 0.2096280 secs] 3151199K->3117154K(4091776K)
icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00, real=0.21 secs]
2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408: [ParNew:
37288K->3928K(38336K), 0.0873820 secs] 3151266K->3117907K(4091776K)
icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217: [ParNew:
38040K->2095K(38336K), 0.0791560 secs] 3152019K->3116839K(4091776K)
icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240: [ParNew:
36207K->2581K(38336K), 0.1651210 secs] 3150951K->3117700K(4091776K)
icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00, real=0.16 secs]
2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510: [ParNew:
36693K->2297K(38336K), 0.1262480 secs] 3151812K->3117742K(4091776K)
icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211: [ParNew:
36409K->4224K(38336K), 0.3658760 secs] 3151854K->3120205K(4091776K)
icms_dc=6 , 0.3660220 secs]

RE: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)

Reply via email to