On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic <
[email protected]> wrote:

> Now we are trying to import 50 millions rows of data. Each row have
> 100 columns (in reality we will have sparsely populated table, but now
> we are testing worst-case scenario). We are having 50 million records
> encoded in about 100 CSV files stored in HDFS.
>


50Millions for such a cluster is a small number of rows.  100 columns per
row should work out fine.  You have one column family only, right?



>
> Importing process is really simple one: small map reduce program will
> read CSV file, split lines and insert it into table (only Map, no
> Reduce parts). We are using default hadoop configuration (on 7 nodes
> we can run 14 maps). Also we are using 32MB for writeBufferSize on
> HBase and also we set setWriteToWAL to false.
>
>
The mapreduce tasks are running on same nodes as hbase+datanodes?  WIth 8G
of RAM only, that might be a bit of a stretch.  You have monitoring on these
machines?  Any swapping?   Or are they fine?




> At the beginning everything looks fine, but after ~33 millions of
> records we are encounter strange behavior of HBase.
>
> Firstly one of nodes where META table resides have high load. Status
> web page shows ~1700 requests on that node even if we are not running
> any MapReduce (0 request on other nodes).


See other message.

Are you inserting one row only per map task or more than this?  You are
reusing an HTable instance?  Or failing that passing the same
HBaseConfiguration each time?  If you make a new HTable with a new
HBaseConfiguration each time then it does not make use of cache of region
locations; it has to go fetch them again.  This can make for extra loading
on .META. table.

Regards logs, enable DEBUG if you can (See FAQ for how).


> second manifestation is that i can create new empty table and start
> importing data normaly, but if i try to import more data into same
> table (now having ~33 millions) i'm having really bad performance and
> hbase status page does not work at all (will not load into browser).
>
> Thats bad.  Can you tell how many regions you have on your cluster?  How
many per server?



> So my questions is: what i'm doing wrong? Is current cluster good
> enough to support 50millions records or my current 33 millions is
> limit on current configuration? Any hints. Also, I'm getting about 800
> inserts per second, is this slow?   Any hint is appreciated.
>
> An insert has 100 columns?  Is this 800/second across the whole cluster?

St.Ack

Reply via email to