Hey Alex, It's either the client or the server box that wasn't responding, either way they didn't talk for 52 seconds when your session timeout was set at 40 seconds. I suggest you also take a look at the ZK log, it should tell you exactly when the session was expired.
J-D On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau <alex.barano...@gmail.com> wrote: > Hello guys, > > I got regionserver crash and trying to find out why. I found > * in regionserver log [1] : ZK session expired, and before that slow hlog > edits; > * nothing in DataNode log [2] and HMaster[4] log; > * some warns in ZK log [3] with EndOfStreamException. > > I wonder can it be because of long GC and thus "stop the world" pause for > about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an > old test box). If it can be caused by GC I'd restart the server with GC > logging, but if not, I'd like to figure out the cause... > > HBase version is 0.89.20100924. > > Thank you, > > Alex Baranau > ---- > Sematext :: http://sematext.com/ > > [1] Regionserver log: ZK session expired, before that slow hlog edits > > 2010-12-08 06:43:15,403 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region agg2,system|tq|s\x00|in licensed License > project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c. > 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store: > Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed > License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c., > hasReferences=true, into > hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp, > sequenceid=1153119174 > 2010-12-08 06:43:24,825 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started; Attempting to free 39.93 MB of total=339.31 MB > 2010-12-08 06:43:24,829 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB, multi=154.82 > MB, memory=33.28 MB > 2010-12-08 06:43:50,871 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > started; Attempting to free 40 MB of total=339.39 MB > 2010-12-08 06:43:50,875 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB, multi=154.96 > MB, memory=33.28 MB > 2010-12-08 06:43:57,095 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog; > editcount=8882 > 2010-12-08 06:44:25,086 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog; > editcount=13054 > 2010-12-08 06:44:55,815 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog; > editcount=13472 > 2010-12-08 06:45:33,003 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5 MB, > free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227, > hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453, > evictedPerRun=636.77880859375 > 2010-12-08 06:45:45,333 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of > raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global > memstore limit of 798.4m exceeded; currently 812.8m and flushing till 698.6m > 2010-12-08 06:45:45,363 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Started memstore flush for region > raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region > memstore size 45.0m > 2010-12-08 06:45:45,364 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Finished snapshotting, commencing flushing stores > 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming flushed file at > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831 > to > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580 > 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store: > Added > hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580, > entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to > raw3,,1291780273350.73d076a68191712c68d16c4805973561. > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 52481ms for sessionid > 0x12c34f1783c0002, closing socket connection and attempting reconnect > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 52481ms for sessionid > 0x12c34f1783c00bf, closing socket connection and attempting reconnect > 2010-12-08 06:46:44,200 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > 2010-12-08 06:46:44,200 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > 2010-12-08 06:46:44,201 WARN > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No longer > connected to ZooKeeper, current state: Disconnected > 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181, > initiating session > 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server ip-10-245-202-207.ec2.internal/ > 10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout = > 40000 > 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12c34f1783c00bf closed > 2010-12-08 06:46:51,264 DEBUG > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: > <localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed > connection with ZooKeeper; /hbase/root-region-server > 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181 > 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181, > initiating session > 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to > reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired, > closing socket connection > 2010-12-08 06:46:58,491 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Expired, type: None, path: null > 2010-12-08 06:46:58,491 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session > expired > 2010-12-08 06:46:58,491 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server > serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235, > load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper > session expired > > > [2] DataNode log: nothing, just noticeable pause for almost 1 minute > > 2010-12-08 06:45:58,120 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_6449721912917715607_913675, duration: 3639142 > 2010-12-08 06:45:58,134 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_6449721912917715607_913675, duration: 427441 > 2010-12-08 06:45:58,148 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_825672696763040280_900371, duration: 6352102264 > 2010-12-08 06:46:44,192 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /127.0.0.1:50010 > 2010-12-08 06:46:51,263 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_8449446729722678098_531164, duration: 401874426400 > 2010-12-08 06:46:51,333 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ, > cliID: > DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335, > offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787, > blockid: blk_-4815658898725325280_624191, duration: 401614939902 > 2010-12-08 06:46:58,266 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls > recoverBlock(block=blk_7609568532460932683_912614, targets=[127.0.0.1:50010 > ]) > > [3] Zookeeper log: some warns > > 2010-12-08 02:50:46,100 - INFO [ProcessThread:-1:preprequestproces...@385] > - Processed session termination for sessionid: 0x12c34f1783d6eba > 2010-12-08 02:50:46,104 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba > 2010-12-08 06:46:44,092 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to > read additional data from client sessionid 0x12c34f1783c00bf, likely client > has closed socket > 2010-12-08 06:46:44,661 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf > 2010-12-08 06:46:44,662 - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to > read additional data from client sessionid 0x12c34f1783c0002, likely client > has closed socket > 2010-12-08 06:46:44,662 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002 > 2010-12-08 06:46:51,219 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket connection > from /10.245.202.207:33388 > 2010-12-08 06:46:51,219 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew session > 0x12c34f1783c00bf at /10.245.202.207:33388 > 2010-12-08 06:46:51,220 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1579] - Established session > 0x12c34f1783c00bf with negotiated timeout 40000 for client / > 10.245.202.207:33388 > 2010-12-08 06:46:51,220 - INFO [ProcessThread:-1:preprequestproces...@385] > - Processed session termination for sessionid: 0x12c34f1783c00bf > 2010-12-08 06:46:51,223 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf > 2010-12-08 06:46:58,001 - INFO [SessionTracker:zookeeperser...@315] - > Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded > 2010-12-08 06:46:58,002 - INFO [ProcessThread:-1:preprequestproces...@385] > - Processed session termination for sessionid: 0x12c34f1783c0002 > 2010-12-08 06:46:58,376 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket connection > from /10.245.202.207:33394 > 2010-12-08 06:46:58,489 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew session > 0x12c34f1783c0002 at /10.245.202.207:33394 > 2010-12-08 06:46:58,489 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1572] - Invalid session 0x12c34f1783c0002 > for client /10.245.202.207:33394, probably expired > 2010-12-08 06:46:58,490 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for > client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002 > > [4] HMaster log: nothing > > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 648 row(s) of meta region {server: > 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>} > complete > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete > 2010-12-08 06:45:35,367 INFO org.apache.hadoop.hbase.master.ServerManager: 1 > region servers, 0 dead, average load 649.0 > 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scanning meta region {server: 10.245.202.207:60020, > regionname: .META.,,1.1028785192, startKey: <>} > 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020, > regionname: -ROOT-,,0.70236052, startKey: <>} > 2010-12-08 06:46:35,369 INFO org.apache.hadoop.hbase.master.ServerManager: 1 > region servers, 0 dead, average load 649.0 > 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.rootScanner scan of 1 row(s) of meta region {server: > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete > 2010-12-08 06:46:58,005 INFO org.apache.hadoop.hbase.master.ServerManager: > ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired > 2010-12-08 06:46:58,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers, > added shutdown processing operation > 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager: > -ROOT- region unset (but not set to be reassigned) > 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager: > META region whose startkey is removed from onlineMetaRegions > 2010-12-08 06:46:58,006 DEBUG > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: > ProcessServerShutdown of ip-10-245-202-207.ec2.internal,60020,1289379196235 > 2010-12-08 06:46:58,006 INFO > org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of > server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: false, > rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0 > 2010-12-08 06:46:58,012 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Splitting 34 hlog(s) in > hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235 >