I use HFileOutputFormat.configureIncrementalLoad() to run a MR job,
which lasts about 10 minutes.

On Wed, Jun 29, 2011 at 11:53 AM, Stack <[email protected]> wrote:
> I'm just guessing the MR job is vampiring i/o from the zk processes.
> Can you check?
> St.Ack
>
> On Tue, Jun 28, 2011 at 8:29 PM, Gan, Xiyun <[email protected]> wrote:
>> Thanks Stack.
>>
>> Yeah, I have M/R job on Zookeeper nodes. Does it necessary to separate
>> zk nodes and tasktrackers?
>>
>> On Wed, Jun 29, 2011 at 11:24 AM, Stack <[email protected]> wrote:
>>> Yes.  Your master is timing out against your zk. Whats going on?  ZK
>>> nodes have MR running on them?  Your master is doing long GC pauses?
>>> St.Ack
>>>
>>> On Tue, Jun 28, 2011 at 7:46 PM, Gan, Xiyun <[email protected]> wrote:
>>>> When using HFileOutputFormat, HMaster often crashes.
>>>>
>>>> I guess this critical problem is caused by the timeout of Zookeeper.
>>>>
>>>> Does anyone know what may have caused this?   How might I prevent this
>>>> from happening again?
>>>>
>>>> HBase Version : 0.90.3, r1100350
>>>> Hadoop Version : 0.20.3-SNAPSHOT, r1057313
>>>>
>>>> Some snippets from the logs are pasted below
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> 2011-06-29 02:26:40,454 INFO org.apache.zookeeper.ClientCnxn: Client
>>>> session timed out, have not heard from server in 200000ms for
>>>> sessionid 0x130d6fba3cc0001, closing socket connection and attempting
>>>> reconnect
>>>> 2011-06-29 02:26:40,557 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Unable to get data of znode
>>>> /hbase/unassigned/3624b719752997d87cbba6529edcd2a7
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/3624b719752997d87cbba6529edcd2a7
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:770)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1709)
>>>>  at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>>> 2011-06-29 02:26:40,557 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> Unexpected ZK exception creating/setting node OFFLINE
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/f3523f7dd3eeaae3bb3c66907227e8b0
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:856)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.createOrForceNodeOffline(ZKAssign.java:246)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:967)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:918)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:746)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:726)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:154)
>>>>  at 
>>>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>  at java.lang.Thread.run(Thread.java:662)
>>>> 2011-06-29 02:26:40,557 INFO org.apache.hadoop.hbase.master.HMaster: 
>>>> Aborting
>>>> 2011-06-29 02:26:40,557 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Received unexpected KeeperException, re-throwing exception
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/3624b719752997d87cbba6529edcd2a7
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:770)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1709)
>>>>  at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>>> 2011-06-29 02:26:40,558 ERROR
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unexpected ZK
>>>> exception timing out CLOSING region
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/3624b719752997d87cbba6529edcd2a7
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:586)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.getDataNoWatch(ZKAssign.java:770)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1709)
>>>>  at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>>> 2011-06-29 02:26:40,558 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions in
>>>> transition timed out:
>>>> BaseLog,C110,1309156850234.bfcc3150f5b6ebd169de3266ce49e764.
>>>> state=OPENING, ts=1309285400162
>>>> 2011-06-29 02:26:40,558 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>>>> OPENING for too long, reassigning
>>>> region=BaseLog,C110,1309156850234.bfcc3150f5b6ebd169de3266ce49e764.
>>>> 2011-06-29 02:26:41,063 DEBUG org.apache.hadoop.hbase.master.HMaster:
>>>> Stopping service threads
>>>> 2011-06-29 02:26:41,064 INFO org.apache.hadoop.ipc.HBaseServer:
>>>> Stopping server on 60000
>>>> 2011-06-29 02:26:41,064 INFO org.apache.hadoop.hbase.master.HMaster$1:
>>>> dev-199-121:60000-BalancerChore exiting
>>>> 2011-06-29 02:26:41,064 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 3 on 60000: exiting
>>>> 2011-06-29 02:26:41,064 INFO
>>>> org.apache.hadoop.hbase.master.CatalogJanitor:
>>>> dev-199-121:60000-CatalogJanitor exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 7 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 6 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer:
>>>> Stopping IPC Server listener on 60000
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 9 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 1 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 4 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 0 on 60000: exiting
>>>> 2011-06-29 02:26:41,065 INFO
>>>> org.apache.hadoop.hbase.master.LogCleaner:
>>>> master-dev-199-121:60000.oldLogCleaner exiting
>>>> 2011-06-29 02:26:41,065 INFO org.apache.hadoop.hbase.master.HMaster:
>>>> Stopping infoServer
>>>> 2011-06-29 02:26:41,065 INFO org.apache.hadoop.ipc.HBaseServer:
>>>> Stopping IPC Server Responder
>>>> 2011-06-29 02:26:41,065 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 5 on 60000: exiting
>>>> 2011-06-29 02:26:41,064 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 2 on 60000: exiting
>>>> 2011-06-29 02:26:41,066 INFO org.apache.hadoop.ipc.HBaseServer: IPC
>>>> Server handler 8 on 60000: exiting
>>>> 2011-06-29 02:26:41,073 INFO org.mortbay.log: Stopped
>>>> [email protected]:60010
>>>> 2011-06-29 02:26:41,390 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server dev-197-149/10.249.197.149:61000
>>>> 2011-06-29 02:26:41,391 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to dev-197-149/10.249.197.149:61000, initiating
>>>> session
>>>> 2011-06-29 02:26:41,394 INFO org.apache.zookeeper.ClientCnxn: Session
>>>> establishment complete on server dev-197-149/10.249.197.149:61000,
>>>> sessionid = 0x130d6fba3cc0001, negotiated timeout = 300000
>>>> 2011-06-29 02:27:21,996 DEBUG
>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
>>>> tracker org.apache.hadoop.hbase.catalog.CatalogTracker@7bd33a6b
>>>> 2011-06-29 02:27:21,997 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Unable to get data of znode
>>>> /hbase/unassigned/bfcc3150f5b6ebd169de3266ce49e764
>>>> java.lang.InterruptedException
>>>>  at java.lang.Object.wait(Native Method)
>>>>  at java.lang.Object.wait(Object.java:485)
>>>>  at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1317)
>>>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:919)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1734)
>>>>  at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>>> 2011-06-29 02:27:21,997 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Successfully
>>>> transitioned 
>>>> region=BaseLog,C110,1309156850234.bfcc3150f5b6ebd169de3266ce49e764.
>>>> into OFFLINE and forcing a new assignment
>>>> 2011-06-29 02:27:21,997 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions in
>>>> transition timed out:
>>>> BaseLog,C2,1309156850235.f3523f7dd3eeaae3bb3c66907227e8b0.
>>>> state=OFFLINE, ts=1309285399926
>>>> 2011-06-29 02:27:21,997 INFO
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
>>>> Closed zookeeper sessionid=0x30d6fc37f40006
>>>> 2011-06-29 02:27:21,997 INFO
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>>>> OFFLINE for too long, reassigning
>>>> BaseLog,C2,1309156850235.f3523f7dd3eeaae3bb3c66907227e8b0. to a random
>>>> server
>>>> 2011-06-29 02:27:22,001 DEBUG
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>> transition=RS_ZK_REGION_OPENING,
>>>> server=dev-195-151,60020,1309276923241,
>>>> region=6aec4056cb58be626bc1686cb77b5557, which is more than 15 seconds
>>>> late
>>>> 2011-06-29 02:27:22,001 INFO org.apache.zookeeper.ZooKeeper: Session:
>>>> 0x30d6fc37f40006 closed
>>>> 2011-06-29 02:27:22,002 INFO org.apache.zookeeper.ClientCnxn:
>>>> EventThread shut down
>>>> 2011-06-29 02:27:22,200 INFO
>>>> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
>>>> ephemeral node deleted, processing expiration
>>>> [dev-192-19,60020,1309277114541]
>>>> 2011-06-29 02:27:22,200 DEBUG
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Creating (or updating) unassigned node for
>>>> 4510ea062f3aee7fea9c8c034e6085af with OFFLINE state
>>>> 2011-06-29 02:27:22,200 DEBUG
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
>>>> was=BaseLog,C110,1309156850234.bfcc3150f5b6ebd169de3266ce49e764.
>>>> state=OPENING, ts=1309285400162
>>>> 2011-06-29 02:27:22,201 DEBUG
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Server stopped;
>>>> skipping assign of
>>>> BaseLog,C110,1309156850234.bfcc3150f5b6ebd169de3266ce49e764.
>>>> state=OFFLINE, ts=1309285642201
>>>> 2011-06-29 02:27:22,201 DEBUG
>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
>>>> was=BaseLog,C2,1309156850235.f3523f7dd3eeaae3bb3c66907227e8b0.
>>>> state=OFFLINE, ts=1309285399926
>>>> 2011-06-29 02:27:22,201 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Unable to set watcher on znode
>>>> (/hbase/unassigned/4510ea062f3aee7fea9c8c034e6085af)
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/4510ea062f3aee7fea9c8c034e6085af
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
>>>>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:260)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.createOrForceNodeOffline(ZKAssign.java:244)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:967)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:918)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:746)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:726)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:154)
>>>>  at 
>>>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>  at java.lang.Thread.run(Thread.java:662)
>>>> 2011-06-29 02:27:22,201 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
>>>> master:60000-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001-0x130d6fba3cc0001
>>>> Received unexpected KeeperException, re-throwing exception
>>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>>> KeeperErrorCode = ConnectionLoss for
>>>> /hbase/unassigned/4510ea062f3aee7fea9c8c034e6085af
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>  at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
>>>>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:260)
>>>>  at 
>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign.createOrForceNodeOffline(ZKAssign.java:244)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:967)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:918)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:746)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:726)
>>>>  at 
>>>> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:154)
>>>>  at 
>>>> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>  at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>
>>>>
>>>> --
>>>> Best wishes
>>>> Gan, Xiyun
>>>>
>>>
>>
>>
>>
>> --
>> Best wishes
>> Gan, Xiyun
>>
>



-- 
Best wishes
Gan, Xiyun

Reply via email to