[
https://issues.apache.org/jira/browse/HBASE-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Connell resolved HBASE-29432.
-------------------------------------
Resolution: Fixed
> ExportSnapshot should support rack-awareness
> --------------------------------------------
>
> Key: HBASE-29432
> URL: https://issues.apache.org/jira/browse/HBASE-29432
> Project: HBase
> Issue Type: Improvement
> Reporter: Charles Connell
> Assignee: Charles Connell
> Priority: Minor
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3
>
>
> At my company we are using ExportSnapshot to copy HBase table snapshots to
> S3, as a backup strategy. ExportSnapshot launches a MapReduce job to perform
> the copy. This means that data flows from the HBase cluster's DataNodes, to a
> YARN cluster's nodes, and then to S3.
> We are running HBase and YARN in AWS. AWS charges a fee for
> cross-availability-zone network traffic, but not for same-availability-zone
> traffic. If we could make the DataNode -> YARN node traffic not cross
> availability zones, backups would be considerably cheaper.
> I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and
> a FileLocationResolver. Here's what they will look like:
> {code}
> /**
> * If desired, you may implement a CustomFileGrouper in order to influence
> how ExportSnapshot
> * chooses which input files go into the MapReduce job's {@link
> InputSplit}s. Your implementation
> * must return a data structure that contains each input file exactly once.
> Files that appear in
> * separate entries in the top-level returned Collection are guaranteed to
> not be placed in the
> * same InputSplit.
> * This can be used to segregate your input files by the rack or host on
> which they are available,
> * which, used in conjunction with {@link FileLocationResolver}, can
> improve the performance
> * of your ExportSnapshot runs.
> * To use this, pass the --custom-file-grouper argument with the fully
> qualified class name of
> * an implementation of CustomFileGrouper that's on the classpath.
> * If this argument is not used, no particular grouping logic will be
> applied.
> */
> public interface CustomFileGrouper {
> Collection<Collection<Pair<SnapshotFileInfo, Long>>>
> getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>>
> snapshotFiles);
> }
> /**
> * If desired, you may implement a FileLocationResolver in order to
> influence the _location_
> * metadata attached to each {@link InputSplit} that ExportSnapshot will
> submit to YARN. The
> * method {@link #getLocationsForInputFiles(Collection)} method is called
> once for each InputSplit
> * being constructed. Whatever is returned will ultimately be reported by
> that split's
> * {@link InputSplit#getLocations()} method. This can be used to encourage
> YARN to schedule
> * the ExportSnapshot's mappers on rack-local or host-local NodeManagers.
> * To use this, pass the --file-location-resolver argument with the fully
> qualified class name of
> * an implementation of FileLocationResolver that's on the classpath.
> * If this argument is not used, no locations will be attached to the
> InputSplits.
> */
> public interface FileLocationResolver {
> Set<String> getLocationsForInputFiles(final
> Collection<Pair<SnapshotFileInfo, Long>> files);
> }
> {code}
> Users can optionally provide implementations of these interfaces on their
> classpath, and tell ExportSnapshot to use them via new options. By default,
> there will be no change in behavior. If users choose to implement these
> plugins, they can influence ExportSnapshot to be topology-aware in a very
> flexible way. I plan to write my own plugins optimized for AWS pricing, but
> that won't be the only way this can be used.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)