Our heapsize is set to 2gb, I think my dev issue was because I was running things off of a few vm's. Even though the compaction is in another thread it would still fail to respond during major compaction.
-Ben On Thu, Apr 14, 2011 at 11:26 AM, Jean-Daniel Cryans <[email protected]>wrote: > Ben, the compaction is done in a background thread, it doesn't block > anything. Now if you had a heap close to 2GB, you could easily run > into issues. > > J-D > > On Thu, Apr 14, 2011 at 10:23 AM, Ben Aldrich <[email protected]> wrote: > > Just to chime in here, the other thing we changed was our max_file_size > is > > now set to 2gb instead of 512mb. This could be causing long compaction > > times. If a compaction takes too long it won't respond and can be marked > as > > dead. I have had this happen on my dev cluster a few times. > > > > -Ben > > > > On Thu, Apr 14, 2011 at 11:20 AM, Jean-Daniel Cryans < > [email protected]>wrote: > > > >> This is probably a red herring, for example if the region server had a > >> big GC pause then the master could have already split the log and the > >> region server wouldn't be able to close it (that's our version of IO > >> fencing). So from that exception look back in the log and see if > >> there's anything like : > >> > >> INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have > >> not heard from server in some_big_number ms > >> > >> J-D > >> > >> On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins > >> <[email protected]> wrote: > >> > > >> > Thanks for the response stack. Yes we tried increasing > >> dfs.datanode.handler.count to 8. At this point I would say it didn't > seem > >> to resolve the issue we are seeing, but we it also doesn't seem to be > >> hurting anything so for right now we're going to leave it in at 8 while > we > >> continue to debug. > >> > > >> > In regard to the original error I posted ( Block 'x' is not valid ) > we > >> have chased that down thanks to your suggestion of looking at the logs > for > >> the history of the block. It _looks_ like our 'is not valid' block > errors > >> are unrelated and due to chmod or deleting mapreduce output directories > >> directly after a run. We are still isolating that but it looks like > it's > >> not HBase releated so I'll move that to another list. Thank you very > much > >> for your debugging suggestions. > >> > > >> > The one issue we are still seeing is that we will occasionally have > a > >> regionserver die with the following exception. I need to chase that > down a > >> little more but it seems similar to a post from 2/13/2011 ( > >> http://www.mail-archive.com/[email protected]/msg05550.html ) that > I'm > >> not sure was ever resolved or not. If anyone has any insight on how to > >> debug the following error a little more I would appreciate any thoughts > you > >> might have. > >> > > >> > 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: > Exception > >> closing file /user/hbase/.logs/hd10.dfs.returnpath.net > >> ,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921 : > >> java.io.IOException: Error Recovery for block > >> blk_1315316969665710488_29842654 failed because recovery from primary > >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was > 10.18.0.16:50010. > >> Aborting... > >> > java.io.IOException: Error Recovery for block > >> blk_1315316969665710488_29842654 failed because recovery from primary > >> datanode 10.18.0.16:50010 failed 6 times. Pipeline was > 10.18.0.16:50010. > >> Aborting... > >> > at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841) > >> > at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305) > >> > > >> > Other than the above exception causing a region server to die > >> occasionally everything seems to be working well. > >> > > >> > Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop > >> 0.20.2+923.21 and hbase 0.90.1+15.18 ) and still see the above > exception. > >> We do have ulimit set ( memory unlimited and files 32k ) for the user > >> running hbase. > >> > > >> > Thanks again for your help > >> > > >> > Andy > >> > > >> > -----Original Message----- > >> > From: [email protected] [mailto:[email protected]] On Behalf Of > >> Stack > >> > Sent: Sunday, April 10, 2011 1:16 PM > >> > To: [email protected] > >> > Cc: Andy Sautins > >> > Subject: Re: DFS stability running HBase and > >> dfs.datanode.handler.count... > >> > > >> > Did you try upping it Andy? Andrew Purtell's recommendation though > old > >> would have come of experience. The Intel article reads like sales but > there > >> is probably merit to its suggestion. The Cloudera article is more > unsure > >> about the effect of upping handlers though it allows "...could be set a > bit > >> higher." > >> > > >> > I just looked at our prod frontend and its set to 3 still. I don't > see > >> your exceptions in our DN log. > >> > > >> > What version of hadoop? You say hbase 0.91. You mean 0.90.1? > >> > > >> > ulimit and nproc are set sufficiently high for hadoop/hbase user? > >> > > >> > If you grep 163126943925471435_28809750 in namenode log, do you see a > >> delete occur before a later open? > >> > > >> > St.Ack > >> > > >> > On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins < > >> [email protected]> wrote: > >> >> > >> >> I ran across an mailing list posting from 1/4/2009 that seemed to > >> indicate increasing dfs.datanode.handler.count could help improve DFS > >> stability ( > >> > http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%[email protected]%3E). > The posting seems to indicate the wiki was updated, but I don't seen > >> anything in the wiki about increasing dfs.datanode.handler.count. I > have > >> seen a few other notes that seem to show examples that have raised > >> dfs.datanode.handler.count including one from an IBM article ( > >> > http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/) > and the Pro Hadoop book, but other than that the only other mention I see > >> is from cloudera seems luke-warm on increasing > dfs.datanode.handler.count ( > >> > http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ > ). > >> >> > >> >> Given the post is from 2009 I thought I'd ask if anyone has had > any > >> success improving stability of HBase/DFS when increasing > >> dfs.datanode.handler.count. The specific error we are seeing somewhat > >> frequently ( few hundred times per day ) in the datanode longs is as > >> follows: > >> >> > >> >> 2011-04-09 00:12:48,035 ERROR > >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> >> DatanodeRegistration(10.18.0.33:50010, > >> >> storageID=DS-1501576934-10.18.0.33-50010-1296248656454, > >> >> infoPort=50075, ipcPort=50020):DataXceiver > >> >> java.io.IOException: Block blk_-163126943925471435_28809750 is not > >> valid. > >> >> > >> >> The above seems to correspond to ClosedChannelExceptions in the > hbase > >> regionserver logs as well as some warnings about long write to hlog ( > some > >> in the 50+ seconds ). > >> >> > >> >> The biggest end-user facing issue we are seeing is that Task > Trackers > >> keep getting blacklisted. It's quite possible our problem is unrelated > to > >> anything HBase, but I thought it was worth asking given what we've been > >> seeing. > >> >> > >> >> We are currently running 0.91 on an 18 node cluster with ~3k total > >> regions and each region server is running with 2G of memory. > >> >> > >> >> Any insight would be appreciated. > >> >> > >> >> Thanks > >> >> > >> >> Andy > >> >> > >> > > >> > > >
