After 3 days of running with the configuration changes recommended by J-D
the cluster seems stable now.
For the benefit of others I would say there were two issues identified:
First, the HBASE_HEAP was set too high. It turns out that each Haddop daemon
takes at least 1GB at startup even if it's doing nothing. Since we have a
data node, a task tracker and a thrift server running on each machine those
take up 3GB or RAM that must be accounted for when allocating memory for the
region server.
Second, we had "-XX:+CMSIncrementalMode" configured, which apparently is not
good with multi-core systems.

Thanks J-D for all the help.

-eran



On Mon, Apr 11, 2011 at 23:53, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> Alright so I was able to get the logs from Eran, the HDFS errors are a
> red herring, what followed in the region server log that is really
> important is:
>
> 2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 144490ms for
> sessionid 0x12ee42283320050, closing socket connection and attempting
> reconnect
>
> Which is a 2m20s GC pause. The HDFS errors come from the fact that the
> master split the logs _while_ the region server was sleeping.
>
> J-D
>
> On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans
> <jdcry...@apache.org> wrote:
> > So my understanding is that this log file was opened at 7:29 and then
> > something happened at 10:12:55 as something triggered the recovery on
> > that block. It triggered a recovery of the block with the new name
> > being blk_1213779416283711358_54249
> >
> > It seems that that process was started by the DFS Client at 10:12:55
> > but the RS log starts at 10:14. Would it be possible to see what was
> > before that? Also it would be nice to have a view for those blocks on
> > all the datanodes.
> >
> > It would be nice to do this debugging on IRC is it can require a lot
> > of back and forth.
> >
> > J-D
> >
> > On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <eran@.com> wrote:
> >> There wasn't an attachment, I pasted all the lines from all the NN logs
> that
> >> contain that particular block number inline.
> >>
> >> As for CPU/IO, first there is nothing else running on those servers,
> second,
> >> CPU utilization on the slaves at peak load was around 40% and disk IO
> >> utilization less than 20%. That's the strange thing about it (I have
> another
> >> thread going about the performance), there is no bottleneck I could
> identify
> >> and yet the performance was relatively low, compared to the numbers I
> see
> >> quoted for HBase in other places.
> >>
> >> The first line of the NN log says:
> >> BLOCK* NameSystem.allocateBlock:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com
> %3A60020.1302434963279.blk_1213779416283711358_54194
> >> So it looks like a file name is:
> >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/
> hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279
> >>
> >> Is there a better way to associate a file with a block?
> >>
> >> -eran
> >>
> >>
> >>
> >
>

Reply via email to