[ 
https://issues.apache.org/jira/browse/HBASE-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Connell resolved HBASE-29432.
-------------------------------------
    Resolution: Fixed

> ExportSnapshot should support rack-awareness
> --------------------------------------------
>
>                 Key: HBASE-29432
>                 URL: https://issues.apache.org/jira/browse/HBASE-29432
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Charles Connell
>            Assignee: Charles Connell
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3
>
>
> At my company we are using ExportSnapshot to copy HBase table snapshots to 
> S3, as a backup strategy. ExportSnapshot launches a MapReduce job to perform 
> the copy. This means that data flows from the HBase cluster's DataNodes, to a 
> YARN cluster's nodes, and then to S3.
> We are running HBase and YARN in AWS. AWS charges a fee for 
> cross-availability-zone network traffic, but not for same-availability-zone 
> traffic. If we could make the DataNode -> YARN node traffic not cross 
> availability zones, backups would be considerably cheaper. 
> I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and 
> a FileLocationResolver. Here's what they will look like:
> {code}
>   /**
>    * If desired, you may implement a CustomFileGrouper in order to influence 
> how ExportSnapshot
>    * chooses which input files go into the MapReduce job's {@link 
> InputSplit}s. Your implementation
>    * must return a data structure that contains each input file exactly once. 
> Files that appear in
>    * separate entries in the top-level returned Collection are guaranteed to 
> not be placed in the
>    * same InputSplit.
>    * This can be used to segregate your input files by the rack or host on 
> which they are available,
>    * which, used in conjunction with {@link FileLocationResolver}, can 
> improve the performance
>    * of your ExportSnapshot runs.
>    * To use this, pass the --custom-file-grouper argument with the fully 
> qualified class name of
>    * an implementation of CustomFileGrouper that's on the classpath.
>    * If this argument is not used, no particular grouping logic will be 
> applied.
>    */
>   public interface CustomFileGrouper {
>     Collection<Collection<Pair<SnapshotFileInfo, Long>>>
>       getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>> 
> snapshotFiles);
>   }
>   /**
>    * If desired, you may implement a FileLocationResolver in order to 
> influence the _location_
>    * metadata attached to each {@link InputSplit} that ExportSnapshot will 
> submit to YARN. The
>    * method {@link #getLocationsForInputFiles(Collection)} method is called 
> once for each InputSplit
>    * being constructed. Whatever is returned will ultimately be reported by 
> that split's
>    * {@link InputSplit#getLocations()} method. This can be used to encourage 
> YARN to schedule
>    * the ExportSnapshot's mappers on rack-local or host-local NodeManagers.
>    * To use this, pass the --file-location-resolver argument with the fully 
> qualified class name of
>    * an implementation of FileLocationResolver that's on the classpath.
>    * If this argument is not used, no locations will be attached to the 
> InputSplits.
>    */
>   public interface FileLocationResolver {
>     Set<String> getLocationsForInputFiles(final 
> Collection<Pair<SnapshotFileInfo, Long>> files);
>   }
> {code}
> Users can optionally provide implementations of these interfaces on their 
> classpath, and tell ExportSnapshot to use them via new options. By default, 
> there will be no change in behavior. If users choose to implement these 
> plugins, they can influence ExportSnapshot to be topology-aware in a very 
> flexible way. I plan to write my own plugins optimized for AWS pricing, but 
> that won't be the only way this can be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to