Re: Hbase pausing problems

stack Fri, 15 Jan 2010 10:11:11 -0800

How many CPUs?

You are using default JVM settings (see HBASE_OPTS in hbase-env.sh).  You
might want to enable GC logging.  See the line after hbase-env.sh.  Enable
it.  GC logging might tell you about the pauses you are seeing.

Can you get a fourth server for your cluster and run the master, zk, and
namenode on it and leave the other three servers for regionserver and
datanode (with perhaps replication == 2 as per J-D to lighten load on small
cluster).

More notes inline in below.

On Fri, Jan 15, 2010 at 1:33 AM, Seraph Imalia <ser...@eisp.co.za> wrote:

> Approximately every 10 minutes, our entire coldfusion system pauses at the
> point of inserting into hBase for between 30 and 60 seconds and then
> continues.
>
> Yeah, enable GC logging.  See if you can make correlation between the pause
the client is seeing and a GC pause.

> Investigation...
>
> Watching the logs of the regionserver, the pausing of the coldfusion system
> happens as soon as one of the regionservers starts flushing the memstore
> and
> recovers again as soon as it is finished flushing (recovers as soon as it
> starts compacting).
>

...though, this would seem to point to an issue with your hardware.  How
many disks?  Are they misconfigured such that they hold up the system when
they are being heavily written to?

A regionserver log at DEBUG from around this time so we could look at it
would be helpful.

I can recreate the error just by stopping 1 of the regionservers; but then
> starting the regionserver again does not make coldfusion recover until I
> restart the coldfusion servers.  It is important to note that if I keep the
> built in hBase shell running, it is happily able to put and get data to and
> from hBase whilst coldfusion is busy pausing/failing.
>

This seems odd.  Enable DEBUG for the client-side.  Do you see the shell
recalibrating finding new locations for regions after you shutdown the
single regionserver, something that your coldfusion is not doing?  Or,
maybe, the shell is putting a regionserver that has not been disturbed by
your start/stop?

>
> I have tried increasing the regionserver¹s RAM to 3 Gigs and this just made
> the problem worse because it took longer for the regionservers to flush the
> memory store.

Again, if flushing is holding up the machine, if you can't write a file in
background without it freezing your machine, then your machines are anemic
or misconfigured?

> One of the links I found on your site mentioned increasing
> the default value for hbase.regionserver.handler.count to 100  this did
> not
> seem to make any difference.

Leave this configuration in place I'd say.

Are you seeing 'blocking' messages in the regionserver logs?  Regionserver
will stop taking on writes if it thinks its being overrun to prevent itself
OOME'ing.  Grep the 'multiplier' configuration in hbase-default.xml.

> I have double checked that the memory flush
> very rarely happens on more than 1 regionserver at a time  in fact in my
> many hours of staring at tails of logs, it only happened once where two
> regionservers flushed at the same time.
>
> You've enabled DEBUG?

> My investigations point strongly towards a coding problem on our side
> rather
> than a problem with the server setup or hBase itself.

If things were slow from client-perspective, that might be a client-side
coding problem but these pauses, unless you have a fly-by deadlock in your
client-code, its probably an hbase issue.

>  I say this because
> whilst I understand why a regionserver would go offline during a memory
> flush, I would expect the other two regionservers to pick up the load 
> especially since the built-in hbase shell has no problem accessing hBase
> whilst a regionserver is busy doing a memstore flush.
>
> HBase does not go offline during memory flush.  It continues to be
available for reads and writes during this time.  And see J-D response for
incorrect understanding of how loading of regions is done in an hbase
cluster.

...

I think either I am leaving out code that is required to determine which
> RegionServers are available OR I am keeping too many hBase objects in RAM
> instead of calling their constructors each time (my purpose obviously was
> to
> improve performance).
>
>
For sure keep single instance of HBaseConfiguration at least and use this
constructing all HTable and HBaseAdmin instances.

> Currently the live system is inserting over 7 Million records per day
> (mostly between 8AM and 10PM) which is not a ridiculously high load.
>
>
What size are the records?   What is your table schema?  How many regions do
you currently have in your table?

 St.Ack

Re: Hbase pausing problems

Reply via email to