[jira] Commented: (HBASE-48) [hbase] Bulk load tools

Jonathan Gray (JIRA) Fri, 18 Sep 2009 07:51:41 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757186#action_12757186
 ]


Jonathan Gray commented on HBASE-48:
------------------------------------

The MR job is working tremendously well for me.  I'm able to almost instantly 
saturate my entire cluster during an upload and it remains saturated until the 
end.  Full CPU usage, lots of io-wait, so I'm disk io-bound as I should be.

I did a few runs of a job which imported between 1M and 10M rows, each row 
containing a random number of columns from 1 to 1000.  In the end, I imported 
between 500M and 5B KeyValues.

On a 5 node cluster of 2core/2gb/250gb nodes, I could import 1M rows / 500M 
keys in 7.5 minutes (2.2k rows/sec, 1.1M keys/sec).

On a 10 node cluster of 4core/4gb/500gb nodes, I could do the same import in 
2.5 minutes.  On this larger cluster I also ran the same job but with 10M rows 
/ 5B keys in 25 minutes (6.6k rows/sec, 3.3M keys/sec).

Previously running HTable-based imports on these clusters, I was seeing between 
100k and 200k keys/sec, so this represents a 5-15X speed improvement.  In 
addition, the imports finish without any problem (I would have killed the 
little cluster running these imports through HBase).


I think there is a bug with the ruby script though.  It worked sometimes, but 
other times it ended up hosing the cluster until I restarted.  Things worked 
fine after restart.

Still digging...

> [hbase] Bulk load tools
> -----------------------
>
>                 Key: HBASE-48
>                 URL: https://issues.apache.org/jira/browse/HBASE-48
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: stack
>            Priority: Minor
>         Attachments: 48-v2.patch, 48-v3.patch, 48-v4.patch, 48-v5.patch, 
> 48.patch, loadtable.rb
>
>
> Hbase needs tools to facilitate bulk upload and possibly dumping.  Going via 
> the current APIs, particularly if the dataset is large and cell content is 
> small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write 
> regions directly in hdfs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-48) [hbase] Bulk load tools

Reply via email to