[jira] [Updated] (HBASE-18161) Incremental Load support for Multiple-Table HFileOutputFormat

Densel Santhmayor (JIRA) Fri, 23 Jun 2017 10:03:06 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Densel Santhmayor updated HBASE-18161:
--------------------------------------
    Description: 
h2. Introduction

MapReduce currently supports the ability to write HBase records in bulk to 
HFiles for a single table. The file(s) can then be uploaded to the relevant 
RegionServers information with reasonable latency. This feature is useful to 
make a large set of data available for queries at the same time as well as 
provides a way to efficiently process very large input into HBase without 
affecting query latencies.

There is, however, no support to write variations of the same record key to 
HFiles belonging to multiple HBase tables from within the same MapReduce job.  

h2. Goal

The goal of this JIRA is to extend HFileOutputFormat2 to support writing to 
HFiles for different tables within the same MapReduce job while single-table 
HFile features backwards-compatible. 

For our use case, we needed to write a record key to a smaller HBase table for 
quicker access, and the same record key with a date appended to a larger table 
for longer term storage with chronological access. Each of these tables would 
have different TTL and other settings to support their respective access 
patterns. We also needed to be able to bulk write records to multiple tables 
with different subsets of very large input as efficiently as possible. Rather 
than run the MapReduce job multiple times (one for each table or record 
structure), it would be useful to be able to parse the input a single time and 
write to multiple tables simultaneously.

Additionally, we'd like to maintain backwards compatibility with the existing 
heavily-used HFileOutputFormat2 interface to allow benefits such as locality 
sensitivity (that was introduced long after we implemented support for multiple 
tables) to support both single table and multi table hfile writes. 

h2. Proposal
* Backwards compatibility for existing single table support in 
HFileOutputFormat2 will be maintained and in this case, mappers will need to 
emit the table rowkey as before. However, a new class - MultiHFileOutputFormat 
- will provide a helper function to generate a rowkey for mappers that prefixes 
the desired tablename to the existing rowkey as well as provides 
configureIncrementalLoad support for multiple tables.
* HFileOutputFormat2 will be updated in the following way:
** configureIncrementalLoad will now accept multiple table descriptor and 
region locator pairs, analogous to the single pair currently accepted by 
HFileOutputFormat2. 
** Compression, Block Size, Bloom Type and Datablock settings PER column family 
that are set in the Configuration object are now indexed and retrieved by 
tablename AND column family
** getRegionStartKeys will now support multiple regionlocators and calculate 
split points and therefore partitions collectively for all tables. Similarly, 
now the eventual number of Reducers will be equal to the total number of 
partitions across all tables. 
** The RecordWriter class will be able to process rowkeys either with or 
without the tablename prepended depending on how configureIncrementalLoad was 
configured with MultiHFileOutputFormat or HFileOutputFormat2.
* The use of MultiHFileOutputFormat will write the output into HFiles which 
will match the output format of HFileOutputFormat2. However, while the default 
use case will keep the existing directory structure with column family name as 
the directory and HFiles within that directory, in the case of 
MultiHFileOutputFormat, it will output HFiles in the output directory with the 
following relative paths: 
{noformat}
     --table1 
       --family1 
         --HFiles 
     --table2 
       --family1 
       --family2 
         --HFiles
{noformat}

This aims to be a comprehensive solution to the original tickets - HBASE-3727 
and HBASE-16261. Thanks to [~clayb] for his support. This is a contribution 
from Bloomberg developers.

The patch will be attached shortly.

  was:
h2. Introduction

MapReduce currently supports the ability to write HBase records in bulk to 
HFiles for a single table. The file(s) can then be uploaded to the relevant 
RegionServers information with reasonable latency. This feature is useful to 
make a large set of data available for queries at the same time as well as 
provides a way to efficiently process very large input into HBase without 
affecting query latencies.

There is, however, no support to write variations of the same record key to 
HFiles belonging to multiple HBase tables from within the same MapReduce job.  

h2. Goal

The goal of this JIRA is to extend HFileOutputFormat2 to support writing to 
HFiles for different tables within the same MapReduce job while single-table 
HFile features backwards-compatible. 

For our use case, we needed to write a record key to a smaller HBase table for 
quicker access, and the same record key with a date appended to a larger table 
for longer term storage with chronological access. Each of these tables would 
have different TTL and other settings to support their respective access 
patterns. We also needed to be able to bulk write records to multiple tables 
with different subsets of very large input as efficiently as possible. Rather 
than run the MapReduce job multiple times (one for each table or record 
structure), it would be useful to be able to parse the input a single time and 
write to multiple tables simultaneously.

Additionally, we'd like to maintain backwards compatibility with the existing 
heavily-used HFileOutputFormat2 interface to allow benefits such as locality 
sensitivity (that was introduced long after we implemented support for multiple 
tables) to support both single table and multi table hfile writes. 

h2. Proposal
* Backwards compatibility for existing single table support in 
HFileOutputFormat2 will be maintained and in this case, mappers will need to 
emit the table rowkey as before. However, a new class - MultiHFileOutputFormat 
- will provide a helper function to generate a rowkey for mappers that prefixes 
the desired tablename to the existing rowkey as well as provides 
configureIncrementalLoad support for multiple tables.
* HFileOutputFormat2 will be updated in the following way:
** configureIncrementalLoad will now accept multiple table descriptor and 
region locator pairs, analogous to the single pair currently accepted by 
HFileOutputFormat2. 
** Compression, Block Size, Bloom Type and Datablock settings PER column family 
that are set in the Configuration object are now indexed and retrieved by 
tablename AND column family
** getRegionStartKeys will now support multiple regionlocators and calculate 
split points and therefore partitions collectively for all tables. Similarly, 
now the eventual number of Reducers will be equal to the total number of 
partitions across all tables. 
** The RecordWriter class will be able to process rowkeys either with or 
without the tablename prepended depending on how configureIncrementalLoad was 
configured with MultiHFileOutputFormat or HFileOutputFormat2.
* The use of MultiHFileOutputFormat will write the output into HFiles which 
will match the output format of HFileOutputFormat2. However, while the default 
use case will keep the existing directory structure with column family name as 
the directory and HFiles within that directory, in the case of 
MultiHFileOutputFormat, it will output HFiles in the output directory with the 
following relative paths: 
{noformat}
     --table1 
       --family1 
         --HFiles 
     --table2 
       --family1 
       --family2 
         --HFiles
{noformat}

This aims to be a comprehensive solution to the original tickets - HBASE-3727 
and HBASE-16261. Thanks to [~clayb] for his support.

The patch will be attached shortly.


> Incremental Load support for Multiple-Table HFileOutputFormat
> -------------------------------------------------------------
>
>                 Key: HBASE-18161
>                 URL: https://issues.apache.org/jira/browse/HBASE-18161
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Densel Santhmayor
>            Priority: Minor
>         Attachments: MultiHFileOutputFormatSupport_HBASE_18161.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v2.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v3.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v4.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v5.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v6.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v7.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v8.patch, 
> MultiHFileOutputFormatSupport_HBASE_18161_v9.patch
>
>
> h2. Introduction
> MapReduce currently supports the ability to write HBase records in bulk to 
> HFiles for a single table. The file(s) can then be uploaded to the relevant 
> RegionServers information with reasonable latency. This feature is useful to 
> make a large set of data available for queries at the same time as well as 
> provides a way to efficiently process very large input into HBase without 
> affecting query latencies.
> There is, however, no support to write variations of the same record key to 
> HFiles belonging to multiple HBase tables from within the same MapReduce job. 
>  
> h2. Goal
> The goal of this JIRA is to extend HFileOutputFormat2 to support writing to 
> HFiles for different tables within the same MapReduce job while single-table 
> HFile features backwards-compatible. 
> For our use case, we needed to write a record key to a smaller HBase table 
> for quicker access, and the same record key with a date appended to a larger 
> table for longer term storage with chronological access. Each of these tables 
> would have different TTL and other settings to support their respective 
> access patterns. We also needed to be able to bulk write records to multiple 
> tables with different subsets of very large input as efficiently as possible. 
> Rather than run the MapReduce job multiple times (one for each table or 
> record structure), it would be useful to be able to parse the input a single 
> time and write to multiple tables simultaneously.
> Additionally, we'd like to maintain backwards compatibility with the existing 
> heavily-used HFileOutputFormat2 interface to allow benefits such as locality 
> sensitivity (that was introduced long after we implemented support for 
> multiple tables) to support both single table and multi table hfile writes. 
> h2. Proposal
> * Backwards compatibility for existing single table support in 
> HFileOutputFormat2 will be maintained and in this case, mappers will need to 
> emit the table rowkey as before. However, a new class - 
> MultiHFileOutputFormat - will provide a helper function to generate a rowkey 
> for mappers that prefixes the desired tablename to the existing rowkey as 
> well as provides configureIncrementalLoad support for multiple tables.
> * HFileOutputFormat2 will be updated in the following way:
> ** configureIncrementalLoad will now accept multiple table descriptor and 
> region locator pairs, analogous to the single pair currently accepted by 
> HFileOutputFormat2. 
> ** Compression, Block Size, Bloom Type and Datablock settings PER column 
> family that are set in the Configuration object are now indexed and retrieved 
> by tablename AND column family
> ** getRegionStartKeys will now support multiple regionlocators and calculate 
> split points and therefore partitions collectively for all tables. Similarly, 
> now the eventual number of Reducers will be equal to the total number of 
> partitions across all tables. 
> ** The RecordWriter class will be able to process rowkeys either with or 
> without the tablename prepended depending on how configureIncrementalLoad was 
> configured with MultiHFileOutputFormat or HFileOutputFormat2.
> * The use of MultiHFileOutputFormat will write the output into HFiles which 
> will match the output format of HFileOutputFormat2. However, while the 
> default use case will keep the existing directory structure with column 
> family name as the directory and HFiles within that directory, in the case of 
> MultiHFileOutputFormat, it will output HFiles in the output directory with 
> the following relative paths: 
> {noformat}
>      --table1 
>        --family1 
>          --HFiles 
>      --table2 
>        --family1 
>        --family2 
>          --HFiles
> {noformat}
> This aims to be a comprehensive solution to the original tickets - HBASE-3727 
> and HBASE-16261. Thanks to [~clayb] for his support. This is a contribution 
> from Bloomberg developers.
> The patch will be attached shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HBASE-18161) Incremental Load support for Multiple-Table HFileOutputFormat

Reply via email to