Hi Kaveh,

  How large is your heap that you are using?  Also, what GC settings do you
have in place?  Your main issues looks to be here:

2013-04-22 16:47:21,843 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
ABORTING region server
serverName=d1r1n17.prod.**plutoz.com<http://d1r1n17.prod.plutoz.com/>
,60020,**1366657358443, load=(requests=5
392, regions=196, usedHeap=1063, maxHeap=3966): regionserver:60020-**
0x13dd980d2ab8661-**0x13dd980d2ab8661 regionserver:60020-**0x13dd980d2
ab8661-**0x13dd980d2ab8661 received expired fr
om ZooKeeper, aborting
org.apache.zookeeper.**KeeperException$**SessionExpiredException:
KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**connectio
nEvent(**ZooKeeperWatcher.java:352)
        at org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**process(
ZooKeeperWatcher.java:**270)
        at org.apache.zookeeper.**ClientCnxn$EventThread.**processEvent(
ClientCnxn.java:**523)
        at org.apache.zookeeper.**ClientCnxn$EventThread.run(**ClientCnxn.
java:499)

The interesting part comes afterwards:

2013-04-22 16:47:21,900 WARN org.apache.hadoop.hbase.**regionserver.wal.HLog:
Too many consecutive RollWriter requests, it's a sign of the total number
of live datanodes is lower than the tolerable replicas.

  Are you also seeing your Datanodes drop off the network or become dead
nodes?  My thought is this could be networking issues, oversubscribed
nodes, or GC issues.



On Tue, Apr 23, 2013 at 12:47 AM, kaveh minooie <ka...@plutoz.com> wrote:

> thanks everyone for responding.
>
> No I don't have the GC logs. I don't even know how i can get that. but it
> seems that the regionserver did recovere from that and then gets into
> trouble here:
>
>
> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
> compaction interrupted by user:
> java.io.**InterruptedIOException: Aborting compaction of store f in
> region t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
> 9f565d5da3468c0725e590dc232abc**23. because user requested stop.
>
> the part that I don't understand is what it means when it say "compaction
> interrupted by user"!
>
> and to answer your question ted, I am using 0.90.6 over hadoop 1.1.1 ( i
> can't upgrade since gora so far only works with .90.x ) and no everything
> was normal as far as I could say the map jobs were staggering since, i
> assume, the hbase became unresponsive  ( the web interface start showing
> exception and that is how i figured out that that regionserver was down) ,
> while i was restarting this one ( through the status command in shell ) i
> noticed that two more regionserver went down ( with identicall error , the
> second one, not the one about GC pause ) but once I restarted the
> regionservers (using hbase-daemon.sh)  everything went back to normal.  but
> this keeps happening and as a result i can't left my jobs unsupervised.
>
> thanks,
>
>
> On 04/22/2013 07:35 PM, Ted Yu wrote:
>
>> Kaveh:
>> What version of HBase are you using ?
>> Around 2013-04-22 16:47:56, did you observe anything else happening in
>> your
>> cluster ? See below:
>>
>> 2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.****
>> regionserver.HRegion:
>> compaction interrupted by user:
>> java.io.****InterruptedIOException: Aborting compaction of store f in
>> region
>> t1_webpage,com.pandora.www:****http/shaggy,1366670139658.****9f565d5
>> da3468c0725e590dc232abc**23. because user requested stop.
>>          at org.apache.hadoop.hbase.****regionserver.Store.compact(****
>> Store.
>> java:998)
>>          at org.apache.hadoop.hbase.****regionserver.Store.compact(****
>> Store.
>> java:779)
>>          at org.apache.hadoop.hbase.****regionserver.HRegion.****
>> compactStores(
>> HRegion.java:**776)
>>
>> On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
>> jean-m...@spaggiari.org> wrote:
>>
>>  Hi Kaveh,
>>>
>>> the respons is maybe already displayed on the logs you sent ;)
>>>
>>> "This disconnect could have been caused by a network partition or a
>>> long-running GC pause, either way it's recommended that you verify
>>> your environment."
>>>
>>> Do you have GC logs? Have you tried anything to solve that?
>>>
>>> JM
>>>
>>> 2013/4/22 kaveh minooie <ka...@plutoz.com>:
>>>
>>>> Hi
>>>>
>>>> after a few mapreduce jobs my regionservers shut themselves down. this
>>>> is
>>>> the latest time that this has happened:
>>>>
>>>> 2013-04-22 16:47:21,843 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> This client just lost it's session with ZooKeeper, trying to reconnect.
>>>> 2013-04-22 16:47:21,843 FATAL
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: ABORTING region
>>>>
>>> server
>>>
>>>> serverName=d1r1n17.prod.**plutoz.com <http://d1r1n17.prod.plutoz.com>
>>>> ,60020,**1366657358443, load=(requests=5
>>>> 392, regions=196, usedHeap=1063, maxHeap=3966):
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661 received
>>>> expired
>>>>
>>> fr
>>>
>>>> om ZooKeeper, aborting
>>>> org.apache.zookeeper.**KeeperException$**SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**
>>> connectionEvent(**ZooKeeperWatcher.java:352)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**zookeeper.ZooKeeperWatcher.**
>>> process(ZooKeeperWatcher.java:**270)
>>>
>>>>          at
>>>>
>>>>  org.apache.zookeeper.**ClientCnxn$EventThread.**
>>> processEvent(ClientCnxn.java:**523)
>>>
>>>>          at
>>>> org.apache.zookeeper.**ClientCnxn$EventThread.run(**
>>>> ClientCnxn.java:499)
>>>> 2013-04-22 16:47:21,843 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> Trying to reconnect to zookeeper.
>>>> 2013-04-22 16:47:21,844 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: Dump of metrics:
>>>> requests=1794, regions=196, stores=1561, storefiles=1585,
>>>> storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
>>>> flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
>>>> blockCacheFree=169901776, blockCacheCount=7242,
>>>>
>>> blockCacheHitCount=910925,
>>>
>>>> blockCacheMissCount=1558134, blockCacheEvictedCount=**1344753,
>>>> blockCacheHitRatio=36, blockCacheHitCachingRatio=40
>>>> 2013-04-22 16:47:21,844 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: STOPPED:
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661
>>>> regionserver:60020-**0x13dd980d2ab8661-**0x13dd980d2ab8661 received
>>>> expired
>>>>
>>> from
>>>
>>>> ZooKeeper, aborting
>>>> 2013-04-22 16:47:21,844 INFO org.apache.zookeeper.**ClientCnxn:
>>>> EventThread
>>>> shut down
>>>> 2013-04-22 16:47:21,900 WARN
>>>>
>>> org.apache.hadoop.hbase.**regionserver.wal.HLog:
>>>
>>>> Too many consecutive RollWriter requests, it's a sign of the total
>>>>
>>> number of
>>>
>>>> live datanodes is lower than the tolerable replicas.
>>>> 2013-04-22 16:47:22,341 INFO org.apache.zookeeper.**ZooKeeper:
>>>> Initiating
>>>> client connection, connectString=zk1:2181 sessionTimeout=180000
>>>> watcher=hconnection
>>>> 2013-04-22 16:47:22,357 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: Waiting on 1
>>>> regions
>>>>
>>> to
>>>
>>>> close
>>>> 2013-04-22 16:47:22,394 INFO org.apache.zookeeper.**ClientCnxn: Opening
>>>>
>>> socket
>>>
>>>> connection to server 
>>>> d1r2n2.prod.plutoz.com/10.0.0.**66:2181<http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>.
>>>> Will not
>>>>
>>> attempt
>>>
>>>> to authenticate using SASL (unknown error)
>>>> 2013-04-22 16:47:22,395 INFO org.apache.zookeeper.**ClientCnxn: Socket
>>>> connection established to 
>>>> d1r2n2.prod.plutoz.com/10.0.0.**66:2181<http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>
>>>> ,
>>>>
>>> initiating
>>>
>>>> session
>>>> 2013-04-22 16:47:22,397 INFO org.apache.zookeeper.**ClientCnxn: Session
>>>> establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.**
>>>> 66:2181 <http://d1r2n2.prod.plutoz.com/10.0.0.66:2181>,
>>>> sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
>>>> 2013-04-22 16:47:22,400 INFO
>>>>
>>>>  org.apache.hadoop.hbase.**client.HConnectionManager$**
>>> HConnectionImplementation:
>>>
>>>> Reconnected successfully. This disconnect could have been caused by a
>>>> network partition or a long-running GC pause, either way it's
>>>> recommended
>>>> that you verify your environment.
>>>> 2013-04-22 16:47:22,400 INFO org.apache.zookeeper.**ClientCnxn:
>>>> EventThread
>>>> shut down
>>>> 2013-04-22 16:47:56,830 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> compaction interrupted by user:
>>>> java.io.**InterruptedIOException: Aborting compaction of store f in
>>>> region
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> because user requested stop.
>>>>          at
>>>> org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.java:998)
>>>>          at
>>>> org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.java:779)
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.HRegion.**
>>> compactStores(HRegion.java:**776)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.HRegion.**
>>> compactStores(HRegion.java:**721)
>>>
>>>>          at
>>>>
>>>>  org.apache.hadoop.hbase.**regionserver.**CompactSplitThread.run(**
>>> CompactSplitThread.java:81)
>>>
>>>> 2013-04-22 16:47:56,830 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> aborted compaction on region
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> after 5mins, 58sec
>>>> 2013-04-22 16:47:56,830 INFO
>>>> org.apache.hadoop.hbase.**regionserver.**CompactSplitThread:
>>>> regionserver60020.compactor exiting
>>>> 2013-04-22 16:47:56,832 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.HRegion:
>>>
>>>> Closed
>>>>
>>>>  t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**
>>> 9f565d5da3468c0725e590dc232abc**23.
>>>
>>>> 2013-04-22 16:47:57,363 INFO
>>>>
>>> org.apache.hadoop.hbase.**regionserver.wal.HLog:
>>>
>>>> regionserver60020.logSyncer exiting
>>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020 closing leases
>>>> 2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020 closed leases
>>>> 2013-04-22 16:47:57,366 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: regionserver60020
>>>> exiting
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Shutdown hook
>>>>
>>> starting;
>>>
>>>> hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-**15,5,main]
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.HRegionServer: STOPPED: Shutdown
>>>>
>>> hook
>>>
>>>> 2013-04-22 16:47:57,497 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Starting fs
>>>> shutdown
>>>>
>>> hook
>>>
>>>> thread.
>>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020.leaseChecker closing leases
>>>> 2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.**
>>>> regionserver.Leases:
>>>> regionserver60020.leaseChecker closed leases
>>>> 2013-04-22 16:47:57,598 INFO
>>>> org.apache.hadoop.hbase.**regionserver.ShutdownHook: Shutdown hook
>>>>
>>> finished.
>>>
>>>> I would appreciate it very much if someone could explain to me what just
>>>> happened here.
>>>>
>>>> thanks,
>>>>
>>>
>


-- 
Kevin O'Dell
Systems Engineer, Cloudera

Reply via email to