Hey,
Looking at the log below, it looks like you have run into a Zookeeper
session expiry event. If a RS loses it's ZK lease, it must give up,
since the regions it has open have probably been reassigned already,
and dual assignment Is Very Bad. This is probably a GC related pause,
this line in particular is indicative:
2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 72462ms, ten times longer than scheduled: 3000
In the log somewhere there should be a "session expired" zookeeper
message. I'm not seeing it in the snippets you post.
There are several potential problems here:
- You need to be using the CMS GC. This is very important.
- If you are hitting your cluster hard with a job and maxing out the
IOW, you might be starving the ZK quorum of the IO bandwidth it needs
to do operations.
Can you tell us more about how you are running things? What JVM, GC
config, node count, node spec, etc.
-ryan
On Thu, Nov 26, 2009 at 8:30 AM, Joost Ouwerkerk <[email protected]>
wrote:
I'm experiencing a scenario whereby long-running MapReduce tasks appear
to
be taking RegionServers down, wondering if
1) anyone else is experiencing this
2) this is expected behaviour, and if so,
3) what I can do about it.
Scenario looks something like this (although I may be misinterpreting):
1.TableMapper task has not completed after about one minute
2. RegionServer reports:
2009-11-26 01:53:09,867 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
637450415407903410 lease expired
2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-2878823706426617596 lease expired
2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4783927291358135864 lease expired
2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
5282836853577726911 lease expired
Followed By:
2009-11-26 01:53:09,882 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x252df4a91a0005 to sun.nio.ch.selectionkeyi...@49e808ca
java.io.IOException: TIMED OUT
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 72462ms, ten times longer than scheduled: 3000
2009-11-26 01:53:09,918 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.UnknownScannerException: Name:
5282836853577726911
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1925)
3. Task Fails. (normal behaviour)
4. RegionServer starts closing its regions:
2009-11-26 01:53:10,035 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2009-11-26 01:53:10,104 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
MSG_CALL_SERVER_STARTUP
2009-11-26 01:53:10,187 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:11,653 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:14,184 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:18,567 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:19,147 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:21,265 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
2009-11-26 01:53:45,112 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
5. RegionServer shuts down:
2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 1 on 60020: exiting
2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 0 on 60020: exiting
2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 2 on 60020: exiting
2009-11-26 01:53:12,443 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-11-26 01:53:12,526 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
2009-11-26 01:53:13,169 INFO
org.apache.hadoop.hbase.regionserver.LogFlusher:
regionserver/192.168.1.111:60020.logFlusher exiting
2009-11-26 01:53:13,170 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
2009-11-26 01:53:13,170 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.1.111:60020.cacheFlusher exiting
2009-11-26 01:53:13,170 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver/192.168.1.111:60020.majorCompactionChecker exiting
2009-11-26 01:53:13,169 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/192.168.1.111:60020.compactor exiting
2009-11-26 01:53:13,449 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-11-26 01:53:13,450 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
192.168.1.111:60020
2009-11-26 01:53:19,867 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ZooKeeper: Closing
session: 0x252df4a91a0005
2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn: Closing
ClientCnxn for session: 0x252df4a91a0005
2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn:
Disconnecting ClientCnxn for session: 0x252df4a91a0005
2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ClientCnxn:
EventThread
shut down
2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ZooKeeper: Session:
0x252df4a91a0005 closed
2009-11-26 01:53:19,902 INFO org.apache.hadoop.hbase.Leases:
regionserver/192.168.1.111:60020.leaseChecker closing leases
2009-11-26 01:53:19,903 INFO org.apache.hadoop.hbase.Leases:
regionserver/192.168.1.111:60020.leaseChecker closed leases
2009-11-26 01:53:19,969 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/
192.168.1.111:60020 exiting
2009-11-26 01:53:19,992 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.
2009-11-26 01:53:19,992 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread
complete
Joost