Charles Connell created HBASE-29432:
---------------------------------------

             Summary: ExportSnapshot should support rack-awareness
                 Key: HBASE-29432
                 URL: https://issues.apache.org/jira/browse/HBASE-29432
             Project: HBase
          Issue Type: Improvement
            Reporter: Charles Connell
            Assignee: Charles Connell


At my company we are using ExportSnapshot to copy HBase table snapshots to S3, 
as a backup strategy. ExportSnapshot launches a MapReduce job to perform the 
copy. This means that data flows from the HBase cluster's DataNodes, to a YARN 
cluster's nodes, and then to S3.

We are running HBase in AWS. AWS charges a fee for cross-availability-zone 
network traffic, but not for same-availability-zone traffic. If we could make 
the DataNode -> YARN node traffic not need to cross availability zones, backups 
would be considerably cheaper. 

I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and a 
FileLocationResolver. Here's what they will look like:

{code}

  /**
   * If desired, you may implement a CustomFileGrouper in order to influence 
how ExportSnapshot
   * chooses which input files go into the MapReduce job's {@link InputSplit}s. 
Your implementation
   * must return a data structure that contains each input file exactly once. 
Files that appear in
   * separate entries in the top-level returned Collection are guaranteed to 
not be placed in the
   * same InputSplit.
   * This can be used to segregate your input files by the rack or host on 
which they are available,
   * which, used in conjunction with {@link FileLocationResolver}, can improve 
the performance
   * of your ExportSnapshot runs.
   * To use this, pass the --custom-file-grouper argument with the fully 
qualified class name of
   * an implementation of CustomFileGrouper that's on the classpath.
   * If this argument is not used, no particular grouping logic will be applied.
   */
  public interface CustomFileGrouper {
    Collection<Collection<Pair<SnapshotFileInfo, Long>>>
      getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>> 
snapshotFiles);
  }


  /**
   * If desired, you may implement a FileLocationResolver in order to influence 
the _location_
   * metadata attached to each {@link InputSplit} that ExportSnapshot will 
submit to YARN. The
   * method {@link #getLocationsForInputFiles(Collection)} method is called 
once for each InputSplit
   * being constructed. Whatever is returned will ultimately be reported by 
that split's
   * {@link InputSplit#getLocations()} method. This can be used to encourage 
YARN to schedule
   * the ExportSnapshot's mappers on rack-local or host-local NodeManagers.
   * To use this, pass the --file-location-resolver argument with the fully 
qualified class name of
   * an implementation of FileLocationResolver that's on the classpath.
   * If this argument is not used, no locations will be attached to the 
InputSplits.
   */
  public interface FileLocationResolver {
    Set<String> getLocationsForInputFiles(final 
Collection<Pair<SnapshotFileInfo, Long>> files);
  }
{code}

Users can optionally provide implementations of these interfaces on their 
classpath, and tell ExportSnapshot to use them via new options. By default, 
there will be no change in behavior. If users choose to implement these 
plugins, they can influence ExportSnapshot to be topology-aware in a very 
flexible way. I plan to write my own plugins optimized for AWS pricing, but 
that won't be the only way this can be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to