Hi Alex, You can check the system logs (/var/log/), whether system issued any kill command for that process. This can happen if the resource utilization is more from the RS (like memory usage) for very long time.
Are you using GZIP compression? Thanks & Regards, Gopinathan A **************************************************************************** *********** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -----Original Message----- From: Alex Baranau [mailto:alex.barano...@gmail.com] Sent: Tuesday, May 01, 2012 1:18 AM To: hbase-u...@hadoop.apache.org Subject: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec) Hello, During recent weeks I constantly see some RSs *silently* dying on our HBase cluster. By "silently" I mean that process stops, but no errors in logs [1]. The only thing I can relate to it is long CMS-concurrent-mark: almost 80 seconds. But this should not cause issues as it is not a "stop-the-world" process. Any advice? HBase: hbase-0.90.4-cdh3u3 Hadoop: 0.20.2-cdh3u3 Thank you, Alex Baranau [1] last lines from RS log (no errors before too, and nothing written in *.out file): 2012-04-30 18:52:11,806 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for agg-sa-1.3,0011| te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00 \x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65. because regionserver60020.cacheFlusher; priority=-1, compaction queue size=0 2012-04-30 18:54:58,779 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using new createWriter -- HADOOP-6840 2012-04-30 18:54:58,779 DEBUG org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,133570661339 7/xxx.ec2.internal%3A60020.1335812098651, syncFs=true, hflush=false 2012-04-30 18:54:58,874 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1 335811856672, entries=73789, filesize=63773934. New hlog /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1 335812098651 2012-04-30 18:56:31,867 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up with memory above low water. 2012-04-30 18:56:31,867 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region agg-sa-1.3,s_00I4| tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562 \x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00 \x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805. due to global heap pressure 2012-04-30 18:56:31,867 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for agg-sa-1.3,s_00I4| tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562 \x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00 \x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805. , current region memstore size 138.1m 2012-04-30 18:56:31,867 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2012-04-30 18:56:56,303 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=322.84 MB, free=476.34 MB, max=799.17 MB, blocks=5024, accesses=12189396, hits=127592, hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949, cachingHitsRatio=95.82%%, evictions=0, evicted=0, evictedPerRun=NaN 2012-04-30 18:56:59,026 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.t mp/391890051647401997 to hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/ 1139737908876846168 2012-04-30 18:56:59,034 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/ 1139737908876846168, entries=476418, sequenceid=880198761, memsize=138.1m, filesize=5.7m 2012-04-30 18:56:59,097 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~138.1m for region agg-sa-1.3,s_00I4| tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562 \x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00 \x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805. in 27230ms, sequenceid=880198761, compaction requested=false ~ [2] last lines from GC log: 2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791: [ParNew: 35638K->1118K(38336K), 0.0548970 secs] 3145651K->3111412K(4091776K) icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00, real=0.09 secs] 2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069: [ParNew: 35230K->2224K(38336K), 0.0802440 secs] 3145524K->3112533K(4091776K) icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00, real=0.13 secs] 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-mark: 8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23 secs] 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-preclean-start] 2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291: [ParNew: 36336K->3895K(38336K), 0.0272610 secs] 3146645K->3114281K(4091776K) icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00, real=0.03 secs] 2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560: [ParNew: 37990K->3082K(38336K), 0.0504660 secs] 3148376K->3114329K(4091776K) icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00, real=0.05 secs] 2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931: [ParNew: 37194K->3340K(38336K), 0.0419660 secs] 3148441K->3115727K(4091776K) icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207: [ParNew: 37351K->3249K(38336K), 0.0695950 secs] 3149738K->3115636K(4091776K) icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00, real=0.07 secs] 2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555: [ParNew: 37361K->2512K(38336K), 0.0510560 secs] 3149748K->3116166K(4091776K) icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00, real=0.00 secs] 2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902: [ParNew: 36624K->2616K(38336K), 0.1986770 secs] 3150278K->3116270K(4091776K) icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00, real=0.20 secs] 2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722: [ParNew: 36728K->3450K(38336K), 0.1099190 secs] 3150382K->3117103K(4091776K) icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00, real=0.11 secs] 2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739: [ParNew: 37546K->3176K(38336K), 0.2096280 secs] 3151199K->3117154K(4091776K) icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00, real=0.21 secs] 2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408: [ParNew: 37288K->3928K(38336K), 0.0873820 secs] 3151266K->3117907K(4091776K) icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00, real=0.08 secs] 2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217: [ParNew: 38040K->2095K(38336K), 0.0791560 secs] 3152019K->3116839K(4091776K) icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00, real=0.08 secs] 2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240: [ParNew: 36207K->2581K(38336K), 0.1651210 secs] 3150951K->3117700K(4091776K) icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00, real=0.16 secs] 2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510: [ParNew: 36693K->2297K(38336K), 0.1262480 secs] 3151812K->3117742K(4091776K) icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00, real=0.13 secs] 2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211: [ParNew: 36409K->4224K(38336K), 0.3658760 secs] 3151854K->3120205K(4091776K) icms_dc=6 , 0.3660220 secs]