[ 
https://issues.apache.org/jira/browse/HBASE-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255995#comment-13255995
 ] 

Clint Heath commented on HBASE-5741:
------------------------------------

Lars, et al, thank you for your questions and comments.  I may be missing 
something, so please correct me if I'm wrong, but here's how I see the 
situation:

1) importtsv will auto-split the input data in region-server-sized Hfiles, 
based on hbase.hregion.max.filesize (because we use the HFileOutputFormat and 
it checks this parameter from the client configs).

2) completebulkload then distributes the resulting HFiles to the region 
servers, thereby populating your cluster with all the regions needed to host 
your data.  Alleviating all the memstore flushes, compactions, splitting, and 
other overhead associated with running billions of individual put()s to 
incrementally load your data.  This is why we recommend the bulk load process 
to initially populate HBase tables.

3) pre-splitting your table (based on rowkeys, as is the only mechanism) is 
useless in this scenario, because it will not take into account the size of 
data associated with each key.  The customer would literally have to walk 
through their data and determine how many Gigs of data are associated with each 
rowkey in order to preemptively split their tables in a bulk load 
scenario...and even then, HBase only takes those pre-splits as "suggestions" 
and if the data ends up needing to be split differently, that will happen 
automatically. So what purpose does pre-splitting serve in a bulk load?  Think 
"initial load" of massive amounts of data.

4) Lars had a good point about compression, yes, typically a customer *should* 
set up their table with compression first, but they typically don't know this 
and they can do it after the fact without consequence.

In conclusion, we provide this handy command-line tool to help our HBase 
customers (most of whom are complete newbies and don't know what they're doing) 
get their HBase tables up and running, yet we give them conflicting information 
about how to use the tool.  If we need to tell them to pre-create the table, 
then we should tell them so...clearly.  However, I think it is just simpler and 
more intuitive if the tool did that automatically, as the javadocs indicate it 
will.  Since region splits are determined internally by the HFileOutputFormat 
anyway (if they use the bulkload option), what's the harm?  This enables people 
to get their data loaded into HBase with the least amount of education, 
ramp-up, java development, data mining, etc., etc.
                
> ImportTsv does not check for table existence 
> ---------------------------------------------
>
>                 Key: HBASE-5741
>                 URL: https://issues.apache.org/jira/browse/HBASE-5741
>             Project: HBase
>          Issue Type: Bug
>          Components: mapreduce
>    Affects Versions: 0.90.4
>            Reporter: Clint Heath
>            Assignee: Himanshu Vashishtha
>             Fix For: 0.96.0, 0.94.1
>
>         Attachments: 5741-94.txt, 5741-v3.txt, HBase-5741-v2.patch, 
> HBase-5741.patch
>
>
> The usage statement for the "importtsv" command to hbase claims this:
> "Note: if you do not use this option, then the target table must already 
> exist in HBase" (in reference to the "importtsv.bulk.output" command-line 
> option)
> The truth is, the table must exist no matter what, importtsv cannot and will 
> not create it for you.
> This is the case because the createSubmittableJob method of ImportTsv does 
> not even attempt to check if the table exists already, much less create it:
> (From org.apache.hadoop.hbase.mapreduce.ImportTsv.java)
> 305 HTable table = new HTable(conf, tableName);
> The HTable method signature in use there assumes the table exists and runs a 
> meta scan on it:
> (From org.apache.hadoop.hbase.client.HTable.java)
> 142 * Creates an object to access a HBase table.
> ...
> 151 public HTable(Configuration conf, final String tableName)
> What we should do inside of createSubmittableJob is something similar to what 
> the "completebulkloads" command would do:
> (Taken from org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.java)
> 690 boolean tableExists = this.doesTableExist(tableName);
> 691 if (!tableExists) this.createTable(tableName,dirPath);
> Currently the docs are misleading, the table in fact must exist prior to 
> running importtsv. We should check if it exists rather than assume it's 
> already there and throw the below exception:
> 12/03/14 17:15:42 WARN client.HConnectionManager$HConnectionImplementation: 
> Encountered problems when prefetch META table: 
> org.apache.hadoop.hbase.TableNotFoundException: Cannot find row in .META. for 
> table: myTable2, row=myTable2,,99999999999999
>       at 
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
> ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to