[ 
https://issues.apache.org/jira/browse/HBASE-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386231#comment-14386231
 ] 

Andrew Mains commented on HBASE-13356:
--------------------------------------

Spent some time speccing out a potential implementation for this today:

Interface: 

Jobs wanting to run multiple scans over snapshots can use 
MultiTableSnapshotInputFormat. This can be configured using TableMapreduceUtil, 
as usual, with the signature:
{code}
  /**
   *  Sets up the job for reading from one or more multiple table snapshots, 
with one or more scan per snapshot.
   *  It bypasses hbase servers and read directly from snapshot files.
   *
   * @param snapshotScans map of snapshot name to a list of scans on that 
snapshot.
   * @param mapper  The mapper class to use.
   * @param outputKeyClass  The class of the output key.
   * @param outputValueClass  The class of the output value.
   * @param job  The current job to adjust.  Make sure the passed job is
   * carrying all necessary HBase configuration.
   * @param addDependencyJars upload HBase jars and jars for any of the 
configured
   *           job classes via the distributed cache (tmpjars).
   */
  public static void initMultiTableSnapshotMapperJob(Map<String, 
Collection<Scan>> snapshotScans,
                                                     Class<? extends 
TableMapper> mapper,
                                                     Class<?> outputKeyClass,
                                                     Class<?> outputValueClass, 
Job job,
                                                     boolean addDependencyJars, 
Path tmpRestoreDir
  ) throws IOException {
{code}

Implementation:

Most of the work can be done through delegation to 
TableSnapshotInputFormatImpl. The primary change would be to make 
TableSnapshotInputFormatImpl.InputSplit take in a scan object and restoreDir 
path, instead of retrieving these from the job configuration. This would allow 
MultiTableSnapshotInputFormat to avoid setting an individual scan and restore 
directory on the configuration (they can be passed along by way of the split, 
similar to TableSplit).

Tests:

Any implementation should probably pass at least the tests for 
MultiTableInputFormat, and possibly some of the tests for 
TableSnapshotInputFormat as well.

Thoughts?

> HBase should provide an InputFormat supporting multiple scans in mapreduce 
> jobs over snapshots
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13356
>                 URL: https://issues.apache.org/jira/browse/HBASE-13356
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce
>            Reporter: Andrew Mains
>            Priority: Minor
>
> Currently, HBase supports the pushing of multiple scans to mapreduce jobs 
> over live tables (via MultiTableInputFormat) but only supports a single scan 
> for mapreduce jobs over table snapshots. It would be handy to support 
> multiple scans over snapshots as well, probably through another input format 
> (MultiTableSnapshotInputFormat?). To mimic the functionality present in 
> MultiTableInputFormat, the new input format would likely have to take in the 
> names of all snapshots used in addition to the scans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to