Thanks Stack and Harsh, I'll try both suggestions and update the list with the results.
-eran On Wed, Mar 28, 2012 at 17:21, Harsh J <ha...@cloudera.com> wrote: > Eran, > > For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated" > to > 0 (default). This will help RS survive transient HLog sync > failures (with local DN) by retrying a few times before the RS decides > to shut itself down. > > Also worth investigating if you had too much IO load/etc. on the box > that lead to the DN throwing up an error during sync(). > > P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222 > will also be in CDH3u4. > > On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <e...@gigya.com> wrote: > > Hi Jimmy, > > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I > had > > the same problem with 0.90.4 > > Hadoop 0.20.2 from Cloudera CDH3u1 > > > > This failure happens during large M/R jobs, I have 10 servers and usually > > no more than 1 would fail like this, sometimes none. > > One thing worth mentioning is that the table it is trying to write to has > > over 5000 regions. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <jxi...@cloudera.com> wrote: > > > >> Which version of HDFS and HBase are you using? > >> > >> When the problem happens, can you access the HDFS, for example, from > >> hadoop dfs? > >> > >> Thanks, > >> Jimmy > >> > >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <e...@gigya.com> wrote: > >> > Hi, > >> > > >> > We have region server sporadically stopping under load due supposedly > to > >> > errors writing to HDFS. Things like: > >> > > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > >> while > >> > syncing > >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > Aborting.. > >> > > >> > It's happening with a different region server and data node every > time, > >> so > >> > it's not a problem with one specific server and there doesn't seem to > be > >> > anything really wrong with either of them. I've already increased the > >> file > >> > descriptor limit, datanode xceivers and data node handler count. Any > idea > >> > what can be causing these errors? > >> > > >> > > >> > A more complete log is here: http://pastebin.com/wC90xU2x > >> > > >> > Thanks. > >> > > >> > -eran > >> > > > > -- > Harsh J >