Not sure if its related (or even helpful) but we were using cdh3b4 (which is 
0.90.1) and we saw similar issues with region servers going down.. we didn't 
look at GC logs but we had very high zookeeper leases so its unlikely that the 
GC could have caused the issue.. this problem went away when we upgraded to 
cdh3u3 which is rock steady in terms of region servers.. (havent had a single 
region server crash in a month where on the older version I used to have 1 
crash every couple of days).. the only other difference between the two is that 
we use snappy on the newer one and gz on the old

We also noticed that having replication enabled also contributed to the 
issues.. 


------------------------------
On Tue 1 May, 2012 3:15 PM IST N Keywal wrote:

>Hi Alex,
>
>On the same idea, note that hbase is launched with
>-XX:OnOutOfMemoryError="kill -9 %p".
>
>N.
>
>On Tue, May 1, 2012 at 10:41 AM, Igal Shilman <ig...@wix.com> wrote:
>
>> Hi Alex, just to rule out, oom killer,
>> Try this:
>>
>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>>
>>
>> On Mon, Apr 30, 2012 at 10:48 PM, Alex Baranau <alex.barano...@gmail.com
>> >wrote:
>>
>> > Hello,
>> >
>> > During recent weeks I constantly see some RSs *silently* dying on our
>> HBase
>> > cluster. By "silently" I mean that process stops, but no errors in logs
>> > [1].
>> >
>> > The only thing I can relate to it is long CMS-concurrent-mark: almost 80
>> > seconds. But this should not cause issues as it is not a "stop-the-world"
>> > process.
>> >
>> > Any advice?
>> >
>> > HBase: hbase-0.90.4-cdh3u3
>> > Hadoop: 0.20.2-cdh3u3
>> >
>> > Thank you,
>> > Alex Baranau
>> >
>> > [1]
>> >
>> > last lines from RS log (no errors before too, and nothing written in
>> *.out
>> > file):
>> >
>> > 2012-04-30 18:52:11,806 DEBUG
>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> > requested for agg-sa-1.3,0011|
>> >
>> >
>> te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
>> > because regionserver60020.cacheFlusher; priority=-1, compaction queue
>> > size=0
>> > 2012-04-30 18:54:58,779 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using new
>> > createWriter -- HADOOP-6840
>> > 2012-04-30 18:54:58,779 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
>> >
>> >
>> Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651,
>> > syncFs=true, hflush=false
>> > 2012-04-30 18:54:58,874 INFO
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>> > Roll
>> >
>> >
>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335811856672,
>> > entries=73789, filesize=63773934. New hlog
>> >
>> >
>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651
>> > 2012-04-30 18:56:31,867 INFO
>> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
>> up
>> > with memory above low water.
>> > 2012-04-30 18:56:31,867 INFO
>> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
>> > agg-sa-1.3,s_00I4|
>> >
>> >
>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
>> > due to global heap pressure
>> > 2012-04-30 18:56:31,867 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> > Started memstore flush for agg-sa-1.3,s_00I4|
>> >
>> >
>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.,
>> > current region memstore size 138.1m
>> > 2012-04-30 18:56:31,867 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> > Finished snapshotting, commencing flushing stores
>> > 2012-04-30 18:56:56,303 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=322.84
>> MB,
>> > free=476.34 MB, max=799.17 MB, blocks=5024, accesses=12189396,
>> hits=127592,
>> > hitRatio=1.04%%, cachingAccesses=132480, cachingHits=126949,
>> > cachingHitsRatio=95.82%%, evictions=0, evicted=0, evictedPerRun=NaN
>> > 2012-04-30 18:56:59,026 INFO org.apache.hadoop.hbase.regionserver.Store:
>> > Renaming flushed file at
>> >
>> >
>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/.tmp/391890051647401997
>> > to
>> >
>> >
>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168
>> > 2012-04-30 18:56:59,034 INFO org.apache.hadoop.hbase.regionserver.Store:
>> > Added
>> >
>> >
>> hdfs://zzz.ec2.internal/hbase/agg-sa-1.3/30b127193485342359eadf1586819805/a/1139737908876846168,
>> > entries=476418, sequenceid=880198761, memsize=138.1m, filesize=5.7m
>> > 2012-04-30 18:56:59,097 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> > Finished memstore flush of ~138.1m for region agg-sa-1.3,s_00I4|
>> >
>> >
>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
>> > in 27230ms, sequenceid=880198761, compaction requested=false
>> > ~
>> >
>> > [2]
>> >
>> > last lines from GC log:
>> >
>> > 2012-04-30T18:58:46.683+0000: 105717.791: [GC 105717.791: [ParNew:
>> > 35638K->1118K(38336K), 0.0548970 secs] 3145651K->3111412K(4091776K)
>> > icms_dc=6 , 0.0550360 secs] [Times: user=0.08 sys=0.00, real=0.09 secs]
>> > 2012-04-30T18:58:46.961+0000: 105718.069: [GC 105718.069: [ParNew:
>> > 35230K->2224K(38336K), 0.0802440 secs] 3145524K->3112533K(4091776K)
>> > icms_dc=6 , 0.0803810 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
>> > 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-mark:
>> > 8.770/80.230 secs] [Times: user=61.34 sys=5.69, real=80.23 secs]
>> > 2012-04-30T18:58:47.114+0000: 105718.222: [CMS-concurrent-preclean-start]
>> > 2012-04-30T18:58:47.183+0000: 105718.291: [GC 105718.291: [ParNew:
>> > 36336K->3895K(38336K), 0.0272610 secs] 3146645K->3114281K(4091776K)
>> > icms_dc=6 , 0.0274040 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
>> > 2012-04-30T18:58:47.452+0000: 105718.560: [GC 105718.560: [ParNew:
>> > 37990K->3082K(38336K), 0.0504660 secs] 3148376K->3114329K(4091776K)
>> > icms_dc=6 , 0.0506020 secs] [Times: user=0.06 sys=0.00, real=0.05 secs]
>> > 2012-04-30T18:58:47.823+0000: 105718.930: [GC 105718.931: [ParNew:
>> > 37194K->3340K(38336K), 0.0419660 secs] 3148441K->3115727K(4091776K)
>> > icms_dc=6 , 0.0421110 secs] [Times: user=0.05 sys=0.00, real=0.04 secs]
>> > 2012-04-30T18:58:48.099+0000: 105719.207: [GC 105719.207: [ParNew:
>> > 37351K->3249K(38336K), 0.0695950 secs] 3149738K->3115636K(4091776K)
>> > icms_dc=6 , 0.0697320 secs] [Times: user=0.07 sys=0.00, real=0.07 secs]
>> > 2012-04-30T18:58:48.447+0000: 105719.555: [GC 105719.555: [ParNew:
>> > 37361K->2512K(38336K), 0.0510560 secs] 3149748K->3116166K(4091776K)
>> > icms_dc=6 , 0.0512050 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]
>> > 2012-04-30T18:58:48.794+0000: 105719.902: [GC 105719.902: [ParNew:
>> > 36624K->2616K(38336K), 0.1986770 secs] 3150278K->3116270K(4091776K)
>> > icms_dc=6 , 0.1988160 secs] [Times: user=0.05 sys=0.00, real=0.20 secs]
>> > 2012-04-30T18:58:49.614+0000: 105720.722: [GC 105720.722: [ParNew:
>> > 36728K->3450K(38336K), 0.1099190 secs] 3150382K->3117103K(4091776K)
>> > icms_dc=6 , 0.1100650 secs] [Times: user=0.05 sys=0.00, real=0.11 secs]
>> > 2012-04-30T18:58:50.631+0000: 105721.739: [GC 105721.739: [ParNew:
>> > 37546K->3176K(38336K), 0.2096280 secs] 3151199K->3117154K(4091776K)
>> > icms_dc=6 , 0.2097770 secs] [Times: user=0.10 sys=0.00, real=0.21 secs]
>> > 2012-04-30T18:58:51.300+0000: 105722.408: [GC 105722.408: [ParNew:
>> > 37288K->3928K(38336K), 0.0873820 secs] 3151266K->3117907K(4091776K)
>> > icms_dc=6 , 0.0875310 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
>> > 2012-04-30T18:58:52.109+0000: 105723.217: [GC 105723.217: [ParNew:
>> > 38040K->2095K(38336K), 0.0791560 secs] 3152019K->3116839K(4091776K)
>> > icms_dc=6 , 0.0792970 secs] [Times: user=0.05 sys=0.00, real=0.08 secs]
>> > 2012-04-30T18:58:53.132+0000: 105724.240: [GC 105724.240: [ParNew:
>> > 36207K->2581K(38336K), 0.1651210 secs] 3150951K->3117700K(4091776K)
>> > icms_dc=6 , 0.1652650 secs] [Times: user=0.06 sys=0.00, real=0.16 secs]
>> > 2012-04-30T18:58:55.402+0000: 105726.510: [GC 105726.510: [ParNew:
>> > 36693K->2297K(38336K), 0.1262480 secs] 3151812K->3117742K(4091776K)
>> > icms_dc=6 , 0.1264420 secs] [Times: user=0.06 sys=0.00, real=0.13 secs]
>> > 2012-04-30T18:58:56.103+0000: 105727.211: [GC 105727.211: [ParNew:
>> > 36409K->4224K(38336K), 0.3658760 secs] 3151854K->3120205K(4091776K)
>> > icms_dc=6 , 0.3660220 secs]
>> >
>>

Reply via email to