On Tue, Jan 19, 2010 at 12:13 AM, stack <[email protected]> wrote:
> On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic <
> [email protected]> wrote:
>
>> Now we are trying to import 50 millions rows of data. Each row have
>> 100 columns (in reality we will have sparsely populated table, but now
>> we are testing worst-case scenario). We are having 50 million records
>> encoded in about 100 CSV files stored in HDFS.
>>
>
>
> 50Millions for such a cluster is a small number of rows. 100 columns per
> row should work out fine. You have one column family only, right?
>
Yes. For now we are doing proof-of-concept and we used one family for
everything. In reality, we will have ~10 families for 100 columns
>
>
>>
>> Importing process is really simple one: small map reduce program will
>> read CSV file, split lines and insert it into table (only Map, no
>> Reduce parts). We are using default hadoop configuration (on 7 nodes
>> we can run 14 maps). Also we are using 32MB for writeBufferSize on
>> HBase and also we set setWriteToWAL to false.
>>
>>
> The mapreduce tasks are running on same nodes as hbase+datanodes? WIth 8G
> of RAM only, that might be a bit of a stretch. You have monitoring on these
> machines? Any swapping? Or are they fine?
>
>
No, there is no swapping at all. Also cpu usage is really small.
>
>
>> At the beginning everything looks fine, but after ~33 millions of
>> records we are encounter strange behavior of HBase.
>>
>> Firstly one of nodes where META table resides have high load. Status
>> web page shows ~1700 requests on that node even if we are not running
>> any MapReduce (0 request on other nodes).
>
>
> See other message.
>
> Are you inserting one row only per map task or more than this? You are
> reusing an HTable instance? Or failing that passing the same
> HBaseConfiguration each time? If you make a new HTable with a new
> HBaseConfiguration each time then it does not make use of cache of region
> locations; it has to go fetch them again. This can make for extra loading
> on .META. table.
>
We are having 500000 lines per single CSV file ~518MB. Default
splitting is used. We are using a little modified TableOutputFormat
class (I added support for write buffer size).
So, we are instantiating HBaseConfiguration only in main method, and
leaving rest to (Custom)TableOutputFormat.
> Regards logs, enable DEBUG if you can (See FAQ for how).
>
Will provide logs soon ...
>
>> second manifestation is that i can create new empty table and start
>> importing data normaly, but if i try to import more data into same
>> table (now having ~33 millions) i'm having really bad performance and
>> hbase status page does not work at all (will not load into browser).
>>
>> Thats bad. Can you tell how many regions you have on your cluster? How
> many per server?
>
~1800 regions on cluster and ~250 per node. We are using replication
by factor of 2 (there is
no reason why we used 2 instead of default 3)
Also, if I leave maps to run i will got following errors in datanode logs:
2010-01-18 23:15:15,795 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.177.88.209:50010,
storageID=DS-515966566-10.177.88.209-50010-1263597214826,
infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Block blk_3350193476599136386_135159 is not valid.
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
at java.lang.Thread.run(Thread.java:619)
>
>
>> So my questions is: what i'm doing wrong? Is current cluster good
>> enough to support 50millions records or my current 33 millions is
>> limit on current configuration? Any hints. Also, I'm getting about 800
>> inserts per second, is this slow? Any hint is appreciated.
>>
>> An insert has 100 columns? Is this 800/second across the whole cluster?
>
> St.Ack
>