On Tue, Jan 19, 2010 at 12:13 AM, stack <[email protected]> wrote:
> On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic <
> [email protected]> wrote:
>
>> Now we are trying to import 50 millions rows of data. Each row have
>> 100 columns (in reality we will have sparsely populated table, but now
>> we are testing worst-case scenario). We are having 50 million records
>> encoded in about 100 CSV files stored in HDFS.
>>
>
>
> 50Millions for such a cluster is a small number of rows.  100 columns per
> row should work out fine.  You have one column family only, right?
>

Yes. For now we are doing proof-of-concept and we used one family for
everything. In reality, we will have ~10 families for 100 columns

>
>
>>
>> Importing process is really simple one: small map reduce program will
>> read CSV file, split lines and insert it into table (only Map, no
>> Reduce parts). We are using default hadoop configuration (on 7 nodes
>> we can run 14 maps). Also we are using 32MB for writeBufferSize on
>> HBase and also we set setWriteToWAL to false.
>>
>>
> The mapreduce tasks are running on same nodes as hbase+datanodes?  WIth 8G
> of RAM only, that might be a bit of a stretch.  You have monitoring on these
> machines?  Any swapping?   Or are they fine?
>
>

No, there is no swapping at all. Also cpu usage is really small.

>
>
>> At the beginning everything looks fine, but after ~33 millions of
>> records we are encounter strange behavior of HBase.
>>
>> Firstly one of nodes where META table resides have high load. Status
>> web page shows ~1700 requests on that node even if we are not running
>> any MapReduce (0 request on other nodes).
>
>
> See other message.
>
> Are you inserting one row only per map task or more than this?  You are
> reusing an HTable instance?  Or failing that passing the same
> HBaseConfiguration each time?  If you make a new HTable with a new
> HBaseConfiguration each time then it does not make use of cache of region
> locations; it has to go fetch them again.  This can make for extra loading
> on .META. table.
>

We are having 500000 lines per single CSV file ~518MB. Default
splitting is used. We are using a little modified TableOutputFormat
class (I added support for write buffer size).

So, we are instantiating HBaseConfiguration only in main method, and
leaving rest to (Custom)TableOutputFormat.

> Regards logs, enable DEBUG if you can (See FAQ for how).
>

Will provide logs soon ...

>
>> second manifestation is that i can create new empty table and start
>> importing data normaly, but if i try to import more data into same
>> table (now having ~33 millions) i'm having really bad performance and
>> hbase status page does not work at all (will not load into browser).
>>
>> Thats bad.  Can you tell how many regions you have on your cluster?  How
> many per server?
>

~1800 regions on cluster and ~250 per node. We are using replication
by factor of 2 (there is
no reason why we used 2 instead of default 3)

Also, if I leave maps to run i will got following errors in datanode logs:

2010-01-18 23:15:15,795 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.177.88.209:50010,
storageID=DS-515966566-10.177.88.209-50010-1263597214826,
infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Block blk_3350193476599136386_135159 is not valid.
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
        at java.lang.Thread.run(Thread.java:619)


>
>
>> So my questions is: what i'm doing wrong? Is current cluster good
>> enough to support 50millions records or my current 33 millions is
>> limit on current configuration? Any hints. Also, I'm getting about 800
>> inserts per second, is this slow?   Any hint is appreciated.
>>
>> An insert has 100 columns?  Is this 800/second across the whole cluster?
>
> St.Ack
>

Reply via email to