Awesome, Kudos on actually monitoring your environment! It makes these issues so much easier to track down when users have (at least) basic historical information about their enviroment such as host level information for simple stats like cpu/memory/disk information.

Checkout this ZK page for some other troubleshooting details that can help when issues such as those reported here are found:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

Regards and good luck!

Patrick

Joost Ouwerkerk wrote:
Ryan, you made me rethink my original hypothesis.  After looking ganglia
history, it's clear that the nodes in question were starved of memory,
swapping like crazy.  The expired scanner lease, the region shutting down,
and as you noted, the Zookeeper session expiry, were not a causal chain, but
all the result of the machine grinding to a halt from swapping.  The
MapReduce tasks were allocated too much memory, and an apparent memory leak
in the job we were running was causing the tasks to eat into the
RegionServer's share of the machine's memory.  I've reduced the memory
allocated to tasks in hadoop's "mapred.child.java.opts" to ensure that the
HADOOP_HEAPSIZE + total maximum memory allocated to tasks + the
HBASE_HEAPSIZE is not greater than the memory available on the machine.

On Thu, Nov 26, 2009 at 3:59 PM, Ryan Rawson <[email protected]> wrote:

Hey,

Looking at the log below, it looks like you have run into a Zookeeper
session expiry event.  If a RS loses it's ZK lease, it must give up,
since the regions it has open have probably been reassigned already,
and dual assignment Is Very Bad.  This is probably a GC related pause,
this line in particular is indicative:

 2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 72462ms, ten times longer than scheduled: 3000
In the log somewhere there should be a "session expired" zookeeper
message.  I'm not seeing it in the snippets you post.

There are several potential problems here:
- You need to be using the CMS GC. This is very important.
- If you are hitting your cluster hard with a job and maxing out the
IOW, you might be starving the ZK quorum of the IO bandwidth it needs
to do operations.

Can you tell us more about how you are running things?  What JVM, GC
config, node count, node spec, etc.

-ryan

On Thu, Nov 26, 2009 at 8:30 AM, Joost Ouwerkerk <[email protected]>
wrote:
I'm experiencing a scenario whereby long-running MapReduce tasks appear
to
be taking RegionServers down, wondering if
 1) anyone else is experiencing this
 2) this is expected behaviour, and if so,
 3) what I can do about it.

Scenario looks something like this (although I may be misinterpreting):

1.TableMapper task has not completed after about one minute

2. RegionServer reports:
 2009-11-26 01:53:09,867 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
637450415407903410 lease expired
 2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-2878823706426617596 lease expired
 2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4783927291358135864 lease expired
 2009-11-26 01:53:09,902 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
5282836853577726911 lease expired

Followed By:
 2009-11-26 01:53:09,882 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x252df4a91a0005 to sun.nio.ch.selectionkeyi...@49e808ca
   java.io.IOException: TIMED OUT
       at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
 2009-11-26 01:53:09,867 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 72462ms, ten times longer than scheduled: 3000
 2009-11-26 01:53:09,918 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
   org.apache.hadoop.hbase.UnknownScannerException: Name:
5282836853577726911
       at

org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1925)
3. Task Fails. (normal behaviour)

4. RegionServer starts closing its regions:
 2009-11-26 01:53:10,035 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
 2009-11-26 01:53:10,104 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
MSG_CALL_SERVER_STARTUP
 2009-11-26 01:53:10,187 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:11,653 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:14,184 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:18,567 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:19,147 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:21,265 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...
 2009-11-26 01:53:45,112 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed crawled_pages, ...

5. RegionServer shuts down:
 2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
 2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 1 on 60020: exiting
 2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 0 on 60020: exiting
 2009-11-26 01:53:12,425 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 2 on 60020: exiting
 2009-11-26 01:53:12,443 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
 2009-11-26 01:53:12,526 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
 2009-11-26 01:53:13,169 INFO
org.apache.hadoop.hbase.regionserver.LogFlusher:
regionserver/192.168.1.111:60020.logFlusher exiting
 2009-11-26 01:53:13,170 INFO
org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
 2009-11-26 01:53:13,170 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.1.111:60020.cacheFlusher exiting
 2009-11-26 01:53:13,170 INFO

org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver/192.168.1.111:60020.majorCompactionChecker exiting
 2009-11-26 01:53:13,169 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/192.168.1.111:60020.compactor exiting
 2009-11-26 01:53:13,449 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
 2009-11-26 01:53:13,450 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
192.168.1.111:60020
 2009-11-26 01:53:19,867 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
 2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ZooKeeper: Closing
session: 0x252df4a91a0005
 2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn: Closing
ClientCnxn for session: 0x252df4a91a0005
 2009-11-26 01:53:19,867 INFO org.apache.zookeeper.ClientCnxn:
Disconnecting ClientCnxn for session: 0x252df4a91a0005
 2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ClientCnxn:
EventThread
shut down
 2009-11-26 01:53:19,868 INFO org.apache.zookeeper.ZooKeeper: Session:
0x252df4a91a0005 closed
 2009-11-26 01:53:19,902 INFO org.apache.hadoop.hbase.Leases:
regionserver/192.168.1.111:60020.leaseChecker closing leases
 2009-11-26 01:53:19,903 INFO org.apache.hadoop.hbase.Leases:
regionserver/192.168.1.111:60020.leaseChecker closed leases
 2009-11-26 01:53:19,969 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/
192.168.1.111:60020 exiting
 2009-11-26 01:53:19,992 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.
 2009-11-26 01:53:19,992 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread
complete
Joost


Reply via email to