Just to chime in here, the other thing we changed was our max_file_size is now set to 2gb instead of 512mb. This could be causing long compaction times. If a compaction takes too long it won't respond and can be marked as dead. I have had this happen on my dev cluster a few times.
-Ben On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans <[email protected]>wrote: > This is probably a red herring, for example if the region server had a > big GC pause then the master could have already split the log and the > region server wouldn't be able to close it (that's our version of IO > fencing). So from that exception look back in the log and see if > there's anything like : > > INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have > not heard from server in some_big_number ms > > J-D > > On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins > <[email protected]> wrote: > > > > Thanks for the response stack. Yes we tried increasing > dfs.datanode.handler.count to 8. At this point I would say it didn't seem > to resolve the issue we are seeing, but we it also doesn't seem to be > hurting anything so for right now we're going to leave it in at 8 while we > continue to debug. > > > > In regard to the original error I posted ( Block 'x' is not valid ) we > have chased that down thanks to your suggestion of looking at the logs for > the history of the block. It _looks_ like our 'is not valid' block errors > are unrelated and due to chmod or deleting mapreduce output directories > directly after a run. We are still isolating that but it looks like it's > not HBase releated so I'll move that to another list. Thank you very much > for your debugging suggestions. > > > > The one issue we are still seeing is that we will occasionally have a > regionserver die with the following exception. I need to chase that down a > little more but it seems similar to a post from 2/13/2011 ( > http://www.mail-archive.com/[email protected]/msg05550.html ) that I'm > not sure was ever resolved or not. If anyone has any insight on how to > debug the following error a little more I would appreciate any thoughts you > might have. > > > > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception > closing file /user/hbase/.logs/hd10.dfs.returnpath.net > ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : > java.io.IOException: Error Recovery for block > blk_1315316969665710488_29842654 failed because recovery from primary > datanode 10.18.0.16:50010 failed 6 times. Pipeline was 10.18.0.16:50010. > Aborting... > > java.io.IOException: Error Recovery for block > blk_1315316969665710488_29842654 failed because recovery from primary > datanode 10.18.0.16:50010 failed 6 times. Pipeline was 10.18.0.16:50010. > Aborting... > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305) > > > > Other than the above exception causing a region server to die > occasionally everything seems to be working well. > > > > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop > 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above exception. > We do have ulimit set ( memory unlimited and files 32k ) for the user > running hbase. > > > > Thanks again for your help > > > > Andy > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] On Behalf Of > Stack > > Sent: Sunday, April 10, 2011 1:16 PM > > To: [email protected] > > Cc: Andy Sautins > > Subject: Re: DFS stability running HBase and > dfs.datanode.handler.count... > > > > Did you try upping it Andy? Andrew Purtell's recommendation though old > would have come of experience. The Intel article reads like sales but there > is probably merit to its suggestion. The Cloudera article is more unsure > about the effect of upping handlers though it allows "...could be set a bit > higher." > > > > I just looked at our prod frontend and its set to 3 still. I don't see > your exceptions in our DN log. > > > > What version of hadoop? You say hbase 0.91. You mean 0.90.1? > > > > ulimit and nproc are set sufficiently high for hadoop/hbase user? > > > > If you grep 163126943925471435_28809750 in namenode log, do you see a > delete occur before a later open? > > > > St.Ack > > > > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins < > [email protected]> wrote: > >> > >> I ran across an mailing list posting from 1/4/2009 that seemed to > indicate increasing dfs.datanode.handler.count could help improve DFS > stability ( > http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E). > The posting seems to indicate the wiki was updated, but I don't seen > anything in the wiki about increasing dfs.datanode.handler.count. I have > seen a few other notes that seem to show examples that have raised > dfs.datanode.handler.count including one from an IBM article ( > http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) > and the Pro Hadoop book, but other than that the only other mention I see > is from cloudera seems luke-warm on increasing dfs.datanode.handler.count ( > http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/). > >> > >> Given the post is from 2009 I thought I'd ask if anyone has had any > success improving stability of HBase/DFS when increasing > dfs.datanode.handler.count. The specific error we are seeing somewhat > frequently ( few hundred times per day ) in the datanode longs is as > follows: > >> > >> 2011-04-09 00:12:48,035 ERROR > >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> DatanodeRegistration(10.18.0.33:50010, > >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, > >> infoPort=50075, ipcPort=50020):DataXceiver > >> java.io.IOException: Block blk_-163126943925471435_28809750 is not > valid. > >> > >> The above seems to correspond to ClosedChannelExceptions in the hbase > regionserver logs as well as some warnings about long write to hlog ( some > in the 50+ seconds ). > >> > >> The biggest end-user facing issue we are seeing is that Task Trackers > keep getting blacklisted. It's quite possible our problem is unrelated to > anything HBase, but I thought it was worth asking given what we've been > seeing. > >> > >> We are currently running 0.91 on an 18 node cluster with ~3k total > regions and each region server is running with 2G of memory. > >> > >> Any insight would be appreciated. > >> > >> Thanks > >> > >> Andy > >> > > >
