[
https://issues.apache.org/jira/browse/HADOOP-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540582
]
stack commented on HADOOP-2075:
-------------------------------
Bulk uploader needs to be able to tolerate myriad data input types. Data will
likely need massaging and ultimately, if writing HRegion content directly into
HDFS rather than going against hbase API -- preferred since it'll be dog slow
doing bulk uploads going against hbase API -- then it has to be sorted. Using
mapreduce would make sense.
Look too at using PIG because it has a few LOAD implementations -- from files
on local or HDFS -- and some facility for doing transforms on data moving
tuples around. Would need to write a special STORE operator that wrote the
data sorted out as HRegions direct into HDFS (This would be different than
PIG-6 which is about writing into hbase via API).
Also, chatting with Jim, this is a pretty important issue. This is the first
folks run into when they start to get serious about hbase.
> [hbase] Bulk load and dump tools
> --------------------------------
>
> Key: HADOOP-2075
> URL: https://issues.apache.org/jira/browse/HADOOP-2075
> Project: Hadoop
> Issue Type: New Feature
> Components: contrib/hbase
> Reporter: stack
> Priority: Minor
>
> Hbase needs tools to facilitate bulk upload and possibly dumping. Going via
> the current APIs, particularly if the dataset is large and cell content is
> small, uploads can take a long time even when using many concurrent clients.
> PNUTS folks talked of need for a different API to manage bulk upload/dump.
> Another notion would be to somehow have the bulk loader tools somehow write
> regions directly in hdfs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.