We've got dfs.replication = 3 in hdfs-site.xml doing a grep for "FATAL" and the surrounding 50 lines yields this:
Regionserver log: http://pastebin.com/3cYYNhct HMaster and DataNode logs seem pretty boring, no errors. Some sections of lots of scheduling/deleting blocks... Restarted the HBase nodes, ran the MR job again (it's just reading CSV into a table). Seems to be running just fine. On Sun, Feb 13, 2011 at 11:08 PM, Jonathan Gray <jg...@fb.com> wrote: > The DFS errors are after the server aborts. What is in the log before the > server abort? Doesn't seem to show any reason here which is unusual. > > Anything in the master? Did it time out this RS? You're running with > replication = 1? > >> -----Original Message----- >> From: Bradford Stephens [mailto:bradfordsteph...@gmail.com] >> Sent: Sunday, February 13, 2011 10:59 PM >> To: user@hbase.apache.org >> Subject: "Error recovery for block... failed because recovery from primary >> datanode failed 6 times" >> >> Hey guys, >> >> I'm occasionally getting regionservers going down (running a late RC of .89 >> that Ryan built). 5x c2.xlarge nodes (8gb/6 cores?) on EC2 with EBS drives. >> >> Here's the error message from the RS log. Hadoop fsck shows it's fine. >> >> Any ideas? >> >> >> 2011-02-14 01:51:51,715 INFO >> org.apache.hadoop.hbase.regionserver.HRegion: Closed mobile4- >> 2011021,20110122:37b16319-58e8-4809-bca6-83d7598a41dd:E84F9612-CE1A- >> 4FE1-AAE9- >> 2A7AF8C9B2F1:21519,1297657239532.d15ce98030138cad79e248e0845b70ee. >> 2011-02-14 01:51:51,715 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server >> at: ip-10-243-106-63.ec2.internal,60020,1297656774012 >> 2011-02-14 01:51:51,711 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionCh >> ecker: >> regionserver60020.majorCompactionChecker exiting >> 2011-02-14 01:51:51,856 INFO org.apache.zookeeper.ZooKeeper: Session: >> 0x12e225ef5640002 closed >> 2011-02-14 01:51:51,856 DEBUG >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: >> <ip-10-204-213-153.ec2.internal:/hbase,ip-10-243-106- >> 63.ec2.internal,60020,1297656773719>Closed >> connection with ZooKeeper; /hbase/root-region-server >> 2011-02-14 01:51:58,706 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread >> exiting >> 2011-02-14 01:51:58,706 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 >> exiting >> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases: >> regionserver60020.leaseChecker closing leases >> 2011-02-14 01:52:00,031 INFO org.apache.hadoop.hbase.Leases: >> regionserver60020.leaseChecker closed leases >> 2011-02-14 01:52:00,033 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook >> starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread- >> 10,5,main] >> 2011-02-14 01:52:00,033 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs >> shutdown hook thread. >> 2011-02-14 01:52:00,036 ERROR org.apache.hadoop.hdfs.DFSClient: >> Exception closing file >> /hbase-entest/.logs/ip-10-243-106- >> 63.ec2.internal,60020,1297656774012/10.243.106.63%3A60020.1297660376363 >> : java.io.IOException: IOException flush:java.io.IOException: >> IOException flush:java.io.IOException: IOException >> flush:java.io.IOException: Error Recovery for block >> blk_208685344091455182_10263 failed because recovery from primary >> datanode 10.243.106.63:50010 failed 6 times. Pipeline was >> 10.243.106.63:50010. Aborting... >> java.io.IOException: IOException flush:java.io.IOException: >> IOException flush:java.io.IOException: IOException >> flush:java.io.IOException: Error Recovery for block >> blk_208685344091455182_10263 failed because recovery from primary >> datanode 10.243.106.63:50010 failed 6 times. Pipeline was >> 10.243.106.63:50010. Aborting... >> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3 >> 214) >> at >> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java: >> 97) >> at >> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944) >> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown >> Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces >> sorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(Se >> quenceFileLogWriter.java:123) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:906) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.completeCacheFlush(HLog >> .java:1078) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio >> n.java:943) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegio >> n.java:834) >> at >> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:78 >> 6) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me >> mStoreFlusher.java:250) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(Me >> mStoreFlusher.java:224) >> at >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFl >> usher.java:146) >> 2011-02-14 01:52:00,076 INFO >> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook >> finished. >> 2011-02-14 01:52:00,139 WARN >> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No >> longer connected to ZooKeeper, current state: Disconnected >> >> >> -- >> Bradford Stephens, >> Founder, Drawn to Scale >> drawntoscalehq.com >> 727.697.7528 >> >> http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. >> Process, store, query, search, and serve all your data. >> >> http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and >> Computer Science > -- Bradford Stephens, Founder, Drawn to Scale drawntoscalehq.com 727.697.7528 http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science