Thanks Stack and Harsh, I'll try both suggestions and update the list with
the results.

-eran



On Wed, Mar 28, 2012 at 17:21, Harsh J <ha...@cloudera.com> wrote:

> Eran,
>
> For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated"
> to > 0 (default). This will help RS survive transient HLog sync
> failures (with local DN) by retrying a few times before the RS decides
> to shut itself down.
>
> Also worth investigating if you had too much IO load/etc. on the box
> that lead to the DN throwing up an error during sync().
>
> P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222
> will also be in CDH3u4.
>
> On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner <e...@gigya.com> wrote:
> > Hi Jimmy,
> > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I
> had
> > the same problem with 0.90.4
> > Hadoop 0.20.2 from Cloudera CDH3u1
> >
> > This failure happens during large M/R jobs, I have 10 servers and usually
> > no more than 1 would fail like this, sometimes none.
> > One thing worth mentioning is that the table it is trying to write to has
> > over 5000 regions.
> >
> > -eran
> >
> >
> >
> > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang <jxi...@cloudera.com> wrote:
> >
> >> Which version of HDFS and HBase are you using?
> >>
> >> When the problem happens, can you access the HDFS, for example, from
> >> hadoop dfs?
> >>
> >> Thanks,
> >> Jimmy
> >>
> >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <e...@gigya.com> wrote:
> >> > Hi,
> >> >
> >> > We have region server sporadically stopping under load due supposedly
> to
> >> > errors writing to HDFS. Things like:
> >> >
> >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >> while
> >> > syncing
> >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad.
> Aborting..
> >> >
> >> > It's happening with a different region server and data node every
> time,
> >> so
> >> > it's not a problem with one specific server and there doesn't seem to
> be
> >> > anything really wrong with either of them. I've already increased the
> >> file
> >> > descriptor limit, datanode xceivers and data node handler count. Any
> idea
> >> > what can be causing these errors?
> >> >
> >> >
> >> > A more complete log is here: http://pastebin.com/wC90xU2x
> >> >
> >> > Thanks.
> >> >
> >> > -eran
> >>
>
>
>
> --
> Harsh J
>

Reply via email to