Lars, your article is providing very good information about this topic. Even 4096 xceivers may crash the cluster and cause serious unpredictable problems.
IMO, current documentation is obsolete & misleading. So, my proposal is to update the HBase book accordingly. -- Regards, Laxman > -----Original Message----- > From: Lars George [mailto:[email protected]] > Sent: Thursday, March 22, 2012 12:29 PM > To: [email protected]; [email protected] > Subject: Re: Max xceiver config > > Hi Laxman, > > Did you see (sorry for the plug) > http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html - it > might help determining the number. > > Lars > > On Mar 22, 2012, at 6:43 AM, Laxman wrote: > > > HBase book is recommending to set xceiver > count[dfs.datanode.max.xcievers] > > to 4096 > > http://hbase.apache.org/book.html#hadoop > > > > Why do we need to have xceivers count so high as 4096? > > > > This means each Datanode in cluster is allowing the maximum of > > - 4096 threads with each thread occupying some memory > > - 4096 threads read/write to the disk(s) simultaneously > > > > This actually makes the system more vulnerable (kind of DOS attacks) > by > > over-utilization of the system resources. > > > > Also, this recommendation was based on some issue reported on Hadoop > 0.18. > > IMO, we should not have such high value as recommendation/default > value and > > this value to be tuned as per the capacity requirements. > > > > Related issues > > ============== > > HDFS-162 > > - Reported on 0.18 > > - Raising xciever count to high value caused other problems. > > - Resolution "Cannot Reproduce " > > > > HDFS-1861 > > - Modified the default value to 4096 > > - Source > > http://ccgtech.blogspot.in/2010/02/hadoop-hdfs-deceived-by- > xciever.html > > which again refers to HDFS-162 (Reported on 0.18). > > > > Case study > > ========== > > http://lucene.472066.n3.nabble.com/Blocks-are-getting-corrupted- > under-very-h > > igh-load-tc3527403.html > > In one of our production environment, this value has been set to 4096 > and > > disk waits were very huge due to which some processes were not > responding. > > Also OS is configured to reboot (kernel panic reboot) when some > process is > > not responding for a specific amount of time. > > > > These two configurations has resulted in corrupted data. > > -- > > Regards, > > Laxman > > > > > >
