Hbase pausing problems

Seraph Imalia Fri, 15 Jan 2010 01:34:20 -0800

Hi,

We are using coldfusion as our server-side coding language which is built on
java.   We have written a java class to simplify the coldfusion coding by
providing simple classes to insert data into hBase.


Our hBase cluster is 3 servers...

1. each server has a hadoop datanode.
2. each server has an hbase regionserver.
3. each server has an instance of zookeeper.
4. server A is the hadoop namenode
5. server B is the master hBase server
6. server C has the secondary name node and is also ready to be started as a
master should server B master go down.
7. Each java process has been given 1 Gig of RAM  each server has 8 Gigs of
RAM. 
8. Each server is connected together using a 10/100 3Com Layer 3 Managed
switch and we are planning to put a 10/100/1000 3Com Layer 3 Managed Switch
in to improve the speed of a memstore flush (among other things).

The problem...

Approximately every 10 minutes, our entire coldfusion system pauses at the
point of inserting into hBase for between 30 and 60 seconds and then
continues.

Investigation...

Watching the logs of the regionserver, the pausing of the coldfusion system
happens as soon as one of the regionservers starts flushing the memstore and
recovers again as soon as it is finished flushing (recovers as soon as it
starts compacting).
I can recreate the error just by stopping 1 of the regionservers; but then
starting the regionserver again does not make coldfusion recover until I
restart the coldfusion servers.  It is important to note that if I keep the
built in hBase shell running, it is happily able to put and get data to and
from hBase whilst coldfusion is busy pausing/failing.

I have tried increasing the regionserver¹s RAM to 3 Gigs and this just made
the problem worse because it took longer for the regionservers to flush the
memory store.  One of the links I found on your site mentioned increasing
the default value for hbase.regionserver.handler.count to 100  this did not
seem to make any difference.  I have double checked that the memory flush
very rarely happens on more than 1 regionserver at a time  in fact in my
many hours of staring at tails of logs, it only happened once where two
regionservers flushed at the same time.

My investigations point strongly towards a coding problem on our side rather
than a problem with the server setup or hBase itself.  I say this because
whilst I understand why a regionserver would go offline during a memory
flush, I would expect the other two regionservers to pick up the load 
especially since the built-in hbase shell has no problem accessing hBase
whilst a regionserver is busy doing a memstore flush.

So let me give you some insight into our java code...

We have three main classes (the rest should not have much influence on
this)...

The one class (AdDeliveryData) is used to provide simple functions to
simplify the coldfusion code, the second is used to communicate with hBase
(TableManagement) and the third just contains some simple functions to
create, drop and fetch tables. (HBaseManager).

AdDeliveryData¹s constructor looks like this...

    public AdDeliveryData(String hBaseConfigPath) throws IOException{
        _hbManager = new HBaseManager(hBaseConfigPath);
        
        _adDeliveryTable = new AdDeliveryTable();
        
        try {
            _adDeliveryManagement = _hbManager.getTable(_adDeliveryTable);
        } catch (TableNotFoundException e) {
            _adDeliveryManagement =
_hbManager.createTable(_adDeliveryTable);
        }
    }

_hbManager, _adDeliveryTable and _adDeliveryManagement are private class
variables available to the whole class.

TableManagement¹s constructor looks like this...

    public TableManagement(HBaseConfiguration conf, TableDef table) throws
IOException {
        _table = table;

        if (table.is_indexed()) {
            _itd = new IndexedTable(conf, Bytes.toBytes(table.get_name()));
        } else {
            _td = new HTable(conf, table.get_name());
        }
    }

_table, _itd and _td are protected variables available to the whole class.

HBaseManager¹s constructor looks like this...

    public HBaseManager(String configurationPath) throws
MasterNotRunningException {
        Path confPath = new Path(configurationPath);
        hbConf = new HBaseConfiguration();
        hbConf.addResource(confPath);
        hbAdmin = new IndexedTableAdmin(hbConf);
    }

hbConf and hbAdmin are protected class variables available to the whole
class

The constructor for AdDeliveryData only gets called once when coldfusion is
started which in turn runs the constructors for TableManagement and
HBaseManager.

The coldfusion variable that gets stored in the Application scope is called
Application.objAdDeliveryData; then every time Coldfusion needs to insert
data, it calls the Application.objAdDeliveryData.insertAdImpressionData
which calls _adDeliveryManagement.insertOrUpdateRow which in turn builds an
ArrayList of Put¹s and runs _td.put(putList);

I think either I am leaving out code that is required to determine which
RegionServers are available OR I am keeping too many hBase objects in RAM
instead of calling their constructors each time (my purpose obviously was to
improve performance).

Currently the live system is inserting over 7 Million records per day
(mostly between 8AM and 10PM) which is not a ridiculously high load.

Any input will be incredibly helpful  I have a test system up and running
and I am trying to re-create the scenario so that I am not working on a live
environment and then basically all I can do is trial and error.

Please assist?

Regards,
Seraph

Hbase pausing problems

Reply via email to