[ https://issues.apache.org/jira/browse/HADOOP-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12550767 ]
stack commented on HADOOP-2343: ------------------------------- Another one of these happened over on pauls' cluster w/ TRUNK from about a day ago. He has configured to run w/ 40 threads per so I'm guessing its not likely that its lack of allocated threads (could be though): {code} ... 2007-12-11 15:39:15,597 DEBUG hbase.HLog - Closing current log writer /hbase/log_XX.XX.XX.16_1197365632138_60020/hlog.dat.2047 to get a new one 2007-12-11 15:39:15,611 INFO hbase.HLog - new log writer created at /hbase/log_XX.XX.XX.16_1197365632138_60020/hlog.dat.2048 2007-12-11 15:39:15,611 DEBUG hbase.HLog - Found 3 logs to remove using oldest outstanding seqnum of 106741610 from region postlog,img141/6876/angjol7qx.jpg,1197403515753 2007-12-11 15:39:15,612 INFO hbase.HLog - removing old log file /hbase/log_XX.XX.XX.16_1197365632138_60020/hlog.dat.2044 whose highest sequence/edit id is 106644872 2007-12-11 15:39:15,616 INFO hbase.HLog - removing old log file /hbase/log_XX.XX.XX.16_1197365632138_60020/hlog.dat.2045 whose highest sequence/edit id is 106674877 2007-12-11 15:39:15,621 INFO hbase.HLog - removing old log file /hbase/log_XX.XX.XX.16_1197365632138_60020/hlog.dat.2046 whose highest sequence/edit id is 106731580 2007-12-11 15:53:53,090 DEBUG hbase.HRegion - Started memcache flush for region postlog,img212/6231/yoturco8lb.jpg,1197410959126. Size 96.5k 2007-12-11 15:53:53,407 FATAL hbase.HRegionServer - unable to report to master for 858080 milliseconds - aborting server 2007-12-11 15:53:53,407 INFO hbase.Leases - regionserver/0:0:0:0:0:0:0:0:60020 closing leases 2007-12-11 15:53:53,652 WARN ipc.Server - IPC Server handler 32 on 60020, call batchUpdate(postlog,img211/363/15171222365f2bc22xh.jpg,1197410959123, 1195466232000, [EMAIL PROTECTED]) from 38.99.77.106:35490: output error java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.ipc.SocketChannelOutputStream.flushBuffer(SocketChannelOutputStream.java:108) at org.apache.hadoop.ipc.SocketChannelOutputStream.write(SocketChannelOutputStream.java:89) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:663) {code} > [hbase] Stuck regionserver? > --------------------------- > > Key: HADOOP-2343 > URL: https://issues.apache.org/jira/browse/HADOOP-2343 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Assignee: stack > Priority: Minor > > Looking in logs, a regionserver went down because it could not contact the > master after 60 seconds. Watching logging, the HRS is repeatedly checking > all 150 loaded regions over and over again w/ a pause of about 5 seconds > between runs... then there is a suspicious 60+ second gap with no logging as > though the regionserver had hung up on something: > {code} > 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region > postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635 > 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region > postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635: snapshotMemcaches() > determined that there was nothing to do > 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region > postlog,img247/230/seanpaul4li.jpg,1196615889965 > 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region > postlog,img247/230/seanpaul4li.jpg,1196615889965: snapshotMemcaches() > determined that there was nothing to do > 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to > master for 67467 milliseconds - aborting server > 2007-12-03 13:16:04,455 INFO hbase.Leases - > regionserver/0:0:0:0:0:0:0:0:60020 closing leases > 2007-12-03 13:16:04,455 INFO hbase.Leases$LeaseMonitor - > regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker exiting > {code} > Master seems to be running fine scanning its ~700 regions. Then you see this > in log, before the HRS shuts itself down. > {code} > 2007-12-03 13:14:31,416 INFO hbase.Leases - HMaster.leaseChecker lease > expired 153260899/1532608992007-12-03 13:14:31,417 INFO hbase.HMaster - > XX.XX.XX.102:60020 lease expired > {code} > ... and we go on to process shutdown. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.