Hey Alex,

It's either the client or the server box that wasn't responding,
either way they didn't talk for 52 seconds when your session timeout
was set at 40 seconds. I suggest you also take a look at the ZK log,
it should tell you exactly when the session was expired.

J-D

On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau <alex.barano...@gmail.com> wrote:
> Hello guys,
>
> I got regionserver crash and trying to find out why. I found
> * in regionserver log [1] : ZK session expired, and before that slow hlog
> edits;
> * nothing in DataNode log [2] and HMaster[4] log;
> * some warns in ZK log [3] with EndOfStreamException.
>
> I wonder can it be because of long GC and thus "stop the world" pause for
> about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an
> old test box). If it can be caused by GC I'd restart the server with GC
> logging, but if not, I'd like to figure out the cause...
>
> HBase version is 0.89.20100924.
>
> Thank you,
>
> Alex Baranau
> ----
> Sematext :: http://sematext.com/
>
> [1] Regionserver log: ZK session expired, before that slow hlog edits
>
> 2010-12-08 06:43:15,403 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Starting compaction on region agg2,system|tq|s\x00|in licensed License
> project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.
> 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store:
> Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed
> License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.,
> hasReferences=true, into
> hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp,
> sequenceid=1153119174
> 2010-12-08 06:43:24,825 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 39.93 MB of total=339.31 MB
> 2010-12-08 06:43:24,829 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB, multi=154.82
> MB, memory=33.28 MB
> 2010-12-08 06:43:50,871 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> started; Attempting to free 40 MB of total=339.39 MB
> 2010-12-08 06:43:50,875 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB, multi=154.96
> MB, memory=33.28 MB
> 2010-12-08 06:43:57,095 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog;
> editcount=8882
> 2010-12-08 06:44:25,086 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog;
> editcount=13054
> 2010-12-08 06:44:55,815 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
> IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog;
> editcount=13472
> 2010-12-08 06:45:33,003 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5 MB,
> free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227,
> hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453,
> evictedPerRun=636.77880859375
> 2010-12-08 06:45:45,333 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
> raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global
> memstore limit of 798.4m exceeded; currently 812.8m and flushing till 698.6m
> 2010-12-08 06:45:45,363 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for region
> raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region
> memstore size 45.0m
> 2010-12-08 06:45:45,364 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Finished snapshotting, commencing flushing stores
> 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store:
> Renaming flushed file at
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831
> to
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580
> 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store:
> Added
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580,
> entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to
> raw3,,1291780273350.73d076a68191712c68d16c4805973561.
> 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 52481ms for sessionid
> 0x12c34f1783c0002, closing socket connection and attempting reconnect
> 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 52481ms for sessionid
> 0x12c34f1783c00bf, closing socket connection and attempting reconnect
> 2010-12-08 06:46:44,200 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-12-08 06:46:44,200 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path: null
> 2010-12-08 06:46:44,201 WARN
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No longer
> connected to ZooKeeper, current state: Disconnected
> 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
> initiating session
> 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server ip-10-245-202-207.ec2.internal/
> 10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout =
> 40000
> 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12c34f1783c00bf closed
> 2010-12-08 06:46:51,264 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
> <localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed
> connection with ZooKeeper; /hbase/root-region-server
> 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
> initiating session
> 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired,
> closing socket connection
> 2010-12-08 06:46:58,491 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2010-12-08 06:46:58,491 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> 2010-12-08 06:46:58,491 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server
> serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235,
> load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper
> session expired
>
>
> [2] DataNode log: nothing, just noticeable pause for almost 1 minute
>
> 2010-12-08 06:45:58,120 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_6449721912917715607_913675, duration: 3639142
> 2010-12-08 06:45:58,134 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_6449721912917715607_913675, duration: 427441
> 2010-12-08 06:45:58,148 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_825672696763040280_900371, duration: 6352102264
> 2010-12-08 06:46:44,192 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /127.0.0.1:50010
> 2010-12-08 06:46:51,263 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_8449446729722678098_531164, duration: 401874426400
> 2010-12-08 06:46:51,333 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ,
> cliID:
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> blockid: blk_-4815658898725325280_624191, duration: 401614939902
> 2010-12-08 06:46:58,266 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_7609568532460932683_912614, targets=[127.0.0.1:50010
> ])
>
> [3] Zookeeper log: some warns
>
> 2010-12-08 02:50:46,100 - INFO  [ProcessThread:-1:preprequestproces...@385]
> - Processed session termination for sessionid: 0x12c34f1783d6eba
> 2010-12-08 02:50:46,104 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for
> client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba
> 2010-12-08 06:46:44,092 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x12c34f1783c00bf, likely client
> has closed socket
> 2010-12-08 06:46:44,661 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for
> client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf
> 2010-12-08 06:46:44,662 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@633] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x12c34f1783c0002, likely client
> has closed socket
> 2010-12-08 06:46:44,662 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for
> client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002
> 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket connection
> from /10.245.202.207:33388
> 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew session
> 0x12c34f1783c00bf at /10.245.202.207:33388
> 2010-12-08 06:46:51,220 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1579] - Established session
> 0x12c34f1783c00bf with negotiated timeout 40000 for client /
> 10.245.202.207:33388
> 2010-12-08 06:46:51,220 - INFO  [ProcessThread:-1:preprequestproces...@385]
> - Processed session termination for sessionid: 0x12c34f1783c00bf
> 2010-12-08 06:46:51,223 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for
> client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf
> 2010-12-08 06:46:58,001 - INFO  [SessionTracker:zookeeperser...@315] -
> Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded
> 2010-12-08 06:46:58,002 - INFO  [ProcessThread:-1:preprequestproces...@385]
> - Processed session termination for sessionid: 0x12c34f1783c0002
> 2010-12-08 06:46:58,376 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@250] - Accepted socket connection
> from /10.245.202.207:33394
> 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@769] - Client attempting to renew session
> 0x12c34f1783c0002 at /10.245.202.207:33394
> 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1572] - Invalid session 0x12c34f1783c0002
> for client /10.245.202.207:33394, probably expired
> 2010-12-08 06:46:58,490 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:nioserverc...@1434] - Closed socket connection for
> client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002
>
> [4] HMaster log: nothing
>
> 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 648 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>}
> complete
> 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: -ROOT-,,0.70236052, startKey: <>}
> 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
> 2010-12-08 06:45:35,367 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 0 dead, average load 649.0
> 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: .META.,,1.1028785192, startKey: <>}
> 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
> regionname: -ROOT-,,0.70236052, startKey: <>}
> 2010-12-08 06:46:35,369 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 0 dead, average load 649.0
> 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
> 2010-12-08 06:46:58,005 INFO org.apache.hadoop.hbase.master.ServerManager:
> ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired
> 2010-12-08 06:46:58,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers,
> added shutdown processing operation
> 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
> META region whose startkey is  removed from onlineMetaRegions
> 2010-12-08 06:46:58,006 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
> ProcessServerShutdown of ip-10-245-202-207.ec2.internal,60020,1289379196235
> 2010-12-08 06:46:58,006 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of
> server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: false,
> rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
> 2010-12-08 06:46:58,012 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
> Splitting 34 hlog(s) in
> hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235
>

Reply via email to