[ https://issues.apache.org/jira/browse/HBASE-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Densel Santhmayor updated HBASE-18161: -------------------------------------- Attachment: MultiHFileOutputFormatSupport_HBASE_18161_v11.patch > Incremental Load support for Multiple-Table HFileOutputFormat > ------------------------------------------------------------- > > Key: HBASE-18161 > URL: https://issues.apache.org/jira/browse/HBASE-18161 > Project: HBase > Issue Type: New Feature > Reporter: Densel Santhmayor > Priority: Minor > Attachments: MultiHFileOutputFormatSupport_HBASE_18161.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v10.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v11.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v2.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v3.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v4.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v5.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v6.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v7.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v8.patch, > MultiHFileOutputFormatSupport_HBASE_18161_v9.patch > > > h2. Introduction > MapReduce currently supports the ability to write HBase records in bulk to > HFiles for a single table. The file(s) can then be uploaded to the relevant > RegionServers information with reasonable latency. This feature is useful to > make a large set of data available for queries at the same time as well as > provides a way to efficiently process very large input into HBase without > affecting query latencies. > There is, however, no support to write variations of the same record key to > HFiles belonging to multiple HBase tables from within the same MapReduce job. > > h2. Goal > The goal of this JIRA is to extend HFileOutputFormat2 to support writing to > HFiles for different tables within the same MapReduce job while single-table > HFile features backwards-compatible. > For our use case, we needed to write a record key to a smaller HBase table > for quicker access, and the same record key with a date appended to a larger > table for longer term storage with chronological access. Each of these tables > would have different TTL and other settings to support their respective > access patterns. We also needed to be able to bulk write records to multiple > tables with different subsets of very large input as efficiently as possible. > Rather than run the MapReduce job multiple times (one for each table or > record structure), it would be useful to be able to parse the input a single > time and write to multiple tables simultaneously. > Additionally, we'd like to maintain backwards compatibility with the existing > heavily-used HFileOutputFormat2 interface to allow benefits such as locality > sensitivity (that was introduced long after we implemented support for > multiple tables) to support both single table and multi table hfile writes. > h2. Proposal > * Backwards compatibility for existing single table support in > HFileOutputFormat2 will be maintained and in this case, mappers will need to > emit the table rowkey as before. However, a new class - > MultiHFileOutputFormat - will provide a helper function to generate a rowkey > for mappers that prefixes the desired tablename to the existing rowkey as > well as provides configureIncrementalLoad support for multiple tables. > * HFileOutputFormat2 will be updated in the following way: > ** configureIncrementalLoad will now accept multiple table descriptor and > region locator pairs, analogous to the single pair currently accepted by > HFileOutputFormat2. > ** Compression, Block Size, Bloom Type and Datablock settings PER column > family that are set in the Configuration object are now indexed and retrieved > by tablename AND column family > ** getRegionStartKeys will now support multiple regionlocators and calculate > split points and therefore partitions collectively for all tables. Similarly, > now the eventual number of Reducers will be equal to the total number of > partitions across all tables. > ** The RecordWriter class will be able to process rowkeys either with or > without the tablename prepended depending on how configureIncrementalLoad was > configured with MultiHFileOutputFormat or HFileOutputFormat2. > * The use of MultiHFileOutputFormat will write the output into HFiles which > will match the output format of HFileOutputFormat2. However, while the > default use case will keep the existing directory structure with column > family name as the directory and HFiles within that directory, in the case of > MultiHFileOutputFormat, it will output HFiles in the output directory with > the following relative paths: > {noformat} > --table1 > --family1 > --HFiles > --table2 > --family1 > --family2 > --HFiles > {noformat} > This aims to be a comprehensive solution to the original tickets - HBASE-3727 > and HBASE-16261. Thanks to [~clayb] for his support. This is a contribution > from Bloomberg developers. > The patch will be attached shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)