[ https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikhail Antonov updated HBASE-8073: ----------------------------------- Fix Version/s: (was: 1.3.0) 1.4.0 > HFileOutputFormat support for offline operation > ----------------------------------------------- > > Key: HBASE-8073 > URL: https://issues.apache.org/jira/browse/HBASE-8073 > Project: HBase > Issue Type: Sub-task > Components: mapreduce > Reporter: Nick Dimiduk > Fix For: 2.0.0, 1.4.0 > > Attachments: HBASE-8073-trunk-v0.patch, HBASE-8073-trunk-v1.patch > > > When using HFileOutputFormat to generate HFiles, it inspects the region > topology of the target table. The split points from that table are used to > guide the TotalOrderPartitioner. If the target table does not exist, it is > first created. This imposes an unnecessary dependence on an online HBase and > existing table. > If the table exists, it can be used. However, the job can be smarter. For > example, if there's far more data going into the HFiles than the table > currently contains, the table regions aren't very useful for data split > points. Instead, the input data can be sampled to produce split points more > meaningful to the dataset. LoadIncrementalHFiles is already capable of > handling divergence between HFile boundaries and table regions, so this > should not pose any additional burdon at load time. > The proper method of sampling the data likely requires a custom input format > and an additional map-reduce job perform the sampling. See a relevant > implementation: > https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)