Re: Region Servers Crashing during Random Reads

Charan K Thu, 03 Feb 2011 14:05:55 -0800

Thanks Todd.. I will try it out ..


On Feb 3, 2011, at 1:43 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Charan,
> 
> Your GC settings are way off - 6m newsize will promote way too much to the
> oldgen.
> 
> Try this:
> 
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Xmn256m
> -XX:CMSInitiatingOccupancyFraction=70
> 
> -Todd
> 
> On Thu, Feb 3, 2011 at 12:28 PM, charan kumar <charan.ku...@gmail.com>wrote:
> 
>> HI Jonathan,
>> 
>> Thanks for you quick reply..
>> 
>> Heap is set to 4G.
>> 
>> Following are the JVM opts.
>> export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:NewSize=6m
>> -XX:MaxNewSize=6m"
>> 
>> Are there any other options apart from increasing the RAM?
>> 
>> I am adding some more info about the app.
>> 
>>> We are storing web page data in HBase.
>>> Row key is Hashed URL, for random distribution, since we dont plan to do
>> scan's..
>>> We have LZOCompression Set on this column family.
>>> We were noticing 1500 Reads, when reading the page content.
>>> We have a column family, which stores just metadata of the page "title"
>> etc... When reading this the performance is whopping 12000 TPS.
>> 
>> We though the issue could be because of N/w bandwidth used between HBase
>> and Clients. So we disable LZO Compression on Column Family and started
>> doing the compression of the raw page on the client and decompress it when
>> readind (LZO).
>> 
>>> With this my write performance jumped up from 2000 to 5000 at peak.
>>> With this approach, the servers are crashing... Not sure , why only
>> after
>> turning of LZO... and doing the same from client.
>> 
>> 
>> 
>> On Thu, Feb 3, 2011 at 12:13 PM, Jonathan Gray <jg...@fb.com> wrote:
>> 
>>> How much heap are you running on your RegionServers?
>>> 
>>> 6GB of total RAM is on the low end.  For high throughput applications, I
>>> would recommend at least 6-8GB of heap (so 8+ GB of RAM).
>>> 
>>>> -----Original Message-----
>>>> From: charan kumar [mailto:charan.ku...@gmail.com]
>>>> Sent: Thursday, February 03, 2011 11:47 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Region Servers Crashing during Random Reads
>>>> 
>>>> Hello,
>>>> 
>>>> I am using hbase 0.90.0 with hadoop-append. h/w ( Dell 1950, 2 CPU, 6
>> GB
>>>> RAM)
>>>> 
>>>> I had 9 Region Servers crash (out of 30) in a span of 30 minutes during
>> a
>>> heavy
>>>> reads. It looks like a GC, ZooKeeper Connection Timeout thingy to me.
>>>> I did all recommended configuration from the Hbase wiki... Any other
>>>> suggestions?
>>>> 
>>>> 
>>>> 2011-02-03T09:43:07.890-0800: 70693.632: [GC 70693.632: [ParNew
>>>> (promotion
>>>> failed): 5555K->5540K(5568K), 0.0280950 secs]70693.660:
>>>> [CMS2011-02-03T09:43:16.864-0800: 70702.606: [CMS-concurrent-mark:
>>>> 12.549/69.323 secs] [Times: user=11.90 sys=1.26, real=69.31 secs]
>>>> 
>>>> 2011-02-03T09:53:35.165-0800: 71320.785: [GC 71320.785: [ParNew
>>>> (promotion
>>>> failed): 5568K->5568K(5568K), 0.4384530 secs]71321.224:
>>>> [CMS2011-02-03T09:53:45.111-0800: 71330.731: [CMS-concurrent-mark:
>>>> 17.511/51.564 secs] [Times: user=38.72 sys=5.67, real=51.60 secs]
>>>> 
>>>> 2011-02-03T09:43:07.890-0800: 70693.632: [GC 70693.632: [ParNew
>>>> (promotion
>>>> failed): 5555K->5540K(5568K), 0.0280950 secs]70693.660:
>>>> [CMS2011-02-03T09:43:16.864-0800: 70702.606: [CMS-concurrent-mark:
>>>> 12.549/69.323 secs] [Times: user=11.90 sys=1.26, real=69.31 secs]
>>>> 
>>>> 
>>>> The following is the log entry in region Server
>>>> 
>>>> 2011-02-03 10:37:43,946 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session timed out, have not heard from server in 47172ms for sessionid
>>>> 0x12db9f722421ce3, closing socket connection and attempting reconnect
>>>> 2011-02-03 10:37:43,947 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session timed out, have not heard from server in 48159ms for sessionid
>>>> 0x22db9f722501d93, closing socket connection and attempting reconnect
>>>> 2011-02-03 10:37:44,401 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server XXXXXXXXXXXXXXXX
>>>> 2011-02-03 10:37:44,402 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to XXXXXXXXX, initiating session
>>>> 2011-02-03 10:37:44,709 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server XXXXXXXXXXXXXXX
>>>> 2011-02-03 10:37:44,709 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to XXXXXXXXXXXXXXXXXXXXX, initiating session
>>>> 2011-02-03 10:37:44,767 DEBUG
>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
>> eviction
>>>> started; Attempting to free 81.93 MB of total=696.25 MB
>>>> 2011-02-03 10:37:44,784 DEBUG
>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU
>> eviction
>>>> completed; freed=81.94 MB, total=614.81 MB, single=379.98 MB,
>>>> multi=309.77 MB, memory=0 KB
>>>> 2011-02-03 10:37:45,205 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> reconnect to ZooKeeper service, session 0x22db9f722501d93 has expired,
>>>> closing socket connection
>>>> 2011-02-03 10:37:45,206 INFO
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> entation:
>>>> This client just lost it's session with ZooKeeper, trying to reconnect.
>>>> 2011-02-03 10:37:45,453 INFO
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> entation:
>>>> Trying to reconnect to zookeeper
>>>> 2011-02-03 10:37:45,206 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> reconnect to ZooKeeper service, session 0x12db9f722421ce3 has expired,
>>>> closing socket connection
>>>> gionserver:60020-0x22db9f722501d93 regionserver:60020-
>>>> 0x22db9f722501d93
>>>> received expired from ZooKeeper, aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>>        at
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(
>>>> ZooKeeperWatcher.java:328)
>>>>        at
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeep
>>>> erWatcher.java:246)
>>>>        at
>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja
>>>> va:530)
>>>>        at
>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>>>> handled exception: org.apache.hadoop.hbase.YouAreDeadException: Server
>>>> REPORT rejected; currently processing XXXXXXXXXXXX,60020,1296684296172
>>>> as dead server
>>>> org.apache.hadoop.hbase.YouAreDeadException:
>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>>> currently processing XXXXXXXXXXXX,60020,1296684296172 as dead server
>>>>        at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> Method)
>>>>        at
>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>>>> AccessorImpl.java:39)
>>>>        at
>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>>>> structorAccessorImpl.java:27)
>>>>        at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>>        at
>>>> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>>>> ption.java:96)
>>>>        at
>>>> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>>>> Exception.java:80)
>>>>        at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>>> ort(HRegionServer.java:729)
>>>>        at
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>>>> ava:586)
>>>>        at java.lang.Thread.run(Thread.java:619)
>>>> 
>>>> 
>>>> 2011-02-03T09:53:35.165-0800: 71320.785: [GC 71320.785: [ParNew
>>>> (promotion
>>>> failed): 5568K->5568K(5568K), 0.4384530 secs]71321.224:
>>>> [CMS2011-02-03T09:53:45.111-0800: 71330.731: [CMS-concurrent-mark:
>>>> 17.511/51.564 secs] [Times: user=38.72 sys=5.67, real=51.60 secs]
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> Charan
>>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Region Servers Crashing during Random Reads

Reply via email to