[jira] Updated: (HADOOP-2343) [hbase] Stuck regionserver?
[ https://issues.apache.org/jira/browse/HADOOP-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Kellerman updated HADOOP-2343: -- Priority: Trivial (was: Minor) [hbase] Stuck regionserver? --- Key: HADOOP-2343 URL: https://issues.apache.org/jira/browse/HADOOP-2343 Project: Hadoop Issue Type: Bug Components: contrib/hbase Reporter: stack Assignee: stack Priority: Trivial Looking in logs, a regionserver went down because it could not contact the master after 60 seconds. Watching logging, the HRS is repeatedly checking all 150 loaded regions over and over again w/ a pause of about 5 seconds between runs... then there is a suspicious 60+ second gap with no logging as though the regionserver had hung up on something: {code} 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region postlog,img247/230/seanpaul4li.jpg,1196615889965 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region postlog,img247/230/seanpaul4li.jpg,1196615889965: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to master for 67467 milliseconds - aborting server 2007-12-03 13:16:04,455 INFO hbase.Leases - regionserver/0:0:0:0:0:0:0:0:60020 closing leases 2007-12-03 13:16:04,455 INFO hbase.Leases$LeaseMonitor - regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker exiting {code} Master seems to be running fine scanning its ~700 regions. Then you see this in log, before the HRS shuts itself down. {code} 2007-12-03 13:14:31,416 INFO hbase.Leases - HMaster.leaseChecker lease expired 153260899/1532608992007-12-03 13:14:31,417 INFO hbase.HMaster - XX.XX.XX.102:60020 lease expired {code} ... and we go on to process shutdown. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2343) [hbase] Stuck regionserver?
[ https://issues.apache.org/jira/browse/HADOOP-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Kellerman updated HADOOP-2343: -- Priority: Minor (was: Major) I think this was probably caused by thread starvation in the master or by not having enough server threads allocated in the master. The latter is a configuration parameter. If the former, it was probably caused by the bug in HADOOP-2338. Changing priority to minor. [hbase] Stuck regionserver? --- Key: HADOOP-2343 URL: https://issues.apache.org/jira/browse/HADOOP-2343 Project: Hadoop Issue Type: Bug Components: contrib/hbase Reporter: stack Assignee: stack Priority: Minor Looking in logs, a regionserver went down because it could not contact the master after 60 seconds. Watching logging, the HRS is repeatedly checking all 150 loaded regions over and over again w/ a pause of about 5 seconds between runs... then there is a suspicious 60+ second gap with no logging as though the regionserver had hung up on something: {code} 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region postlog,img247/230/seanpaul4li.jpg,1196615889965 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region postlog,img247/230/seanpaul4li.jpg,1196615889965: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to master for 67467 milliseconds - aborting server 2007-12-03 13:16:04,455 INFO hbase.Leases - regionserver/0:0:0:0:0:0:0:0:60020 closing leases 2007-12-03 13:16:04,455 INFO hbase.Leases$LeaseMonitor - regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker exiting {code} Master seems to be running fine scanning its ~700 regions. Then you see this in log, before the HRS shuts itself down. {code} 2007-12-03 13:14:31,416 INFO hbase.Leases - HMaster.leaseChecker lease expired 153260899/1532608992007-12-03 13:14:31,417 INFO hbase.HMaster - XX.XX.XX.102:60020 lease expired {code} ... and we go on to process shutdown. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2343) [hbase] Stuck regionserver?
[ https://issues.apache.org/jira/browse/HADOOP-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Duxbury updated HADOOP-2343: -- Priority: Major (was: Minor) Affects cluster stability, but cluster recovers on restart, so changing to major. [hbase] Stuck regionserver? --- Key: HADOOP-2343 URL: https://issues.apache.org/jira/browse/HADOOP-2343 Project: Hadoop Issue Type: Bug Components: contrib/hbase Reporter: stack Assignee: stack Looking in logs, a regionserver went down because it could not contact the master after 60 seconds. Watching logging, the HRS is repeatedly checking all 150 loaded regions over and over again w/ a pause of about 5 seconds between runs... then there is a suspicious 60+ second gap with no logging as though the regionserver had hung up on something: {code} 2007-12-03 13:14:54,178 DEBUG hbase.HRegionServer - flushing region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635 2007-12-03 13:14:54,178 DEBUG hbase.HRegion - Not flushing cache for region postlog,img151/60/plakatlepperduzy1hh7.jpg,1196614355635: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:14:54,205 DEBUG hbase.HRegionServer - flushing region postlog,img247/230/seanpaul4li.jpg,1196615889965 2007-12-03 13:14:54,205 DEBUG hbase.HRegion - Not flushing cache for region postlog,img247/230/seanpaul4li.jpg,1196615889965: snapshotMemcaches() determined that there was nothing to do 2007-12-03 13:16:04,305 FATAL hbase.HRegionServer - unable to report to master for 67467 milliseconds - aborting server 2007-12-03 13:16:04,455 INFO hbase.Leases - regionserver/0:0:0:0:0:0:0:0:60020 closing leases 2007-12-03 13:16:04,455 INFO hbase.Leases$LeaseMonitor - regionserver/0:0:0:0:0:0:0:0:60020.leaseChecker exiting {code} Master seems to be running fine scanning its ~700 regions. Then you see this in log, before the HRS shuts itself down. {code} 2007-12-03 13:14:31,416 INFO hbase.Leases - HMaster.leaseChecker lease expired 153260899/1532608992007-12-03 13:14:31,417 INFO hbase.HMaster - XX.XX.XX.102:60020 lease expired {code} ... and we go on to process shutdown. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.