Charles Connell created HBASE-29432:
---------------------------------------
Summary: ExportSnapshot should support rack-awareness
Key: HBASE-29432
URL: https://issues.apache.org/jira/browse/HBASE-29432
Project: HBase
Issue Type: Improvement
Reporter: Charles Connell
Assignee: Charles Connell
At my company we are using ExportSnapshot to copy HBase table snapshots to S3,
as a backup strategy. ExportSnapshot launches a MapReduce job to perform the
copy. This means that data flows from the HBase cluster's DataNodes, to a YARN
cluster's nodes, and then to S3.
We are running HBase in AWS. AWS charges a fee for cross-availability-zone
network traffic, but not for same-availability-zone traffic. If we could make
the DataNode -> YARN node traffic not need to cross availability zones, backups
would be considerably cheaper.
I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and a
FileLocationResolver. Here's what they will look like:
{code}
/**
* If desired, you may implement a CustomFileGrouper in order to influence
how ExportSnapshot
* chooses which input files go into the MapReduce job's {@link InputSplit}s.
Your implementation
* must return a data structure that contains each input file exactly once.
Files that appear in
* separate entries in the top-level returned Collection are guaranteed to
not be placed in the
* same InputSplit.
* This can be used to segregate your input files by the rack or host on
which they are available,
* which, used in conjunction with {@link FileLocationResolver}, can improve
the performance
* of your ExportSnapshot runs.
* To use this, pass the --custom-file-grouper argument with the fully
qualified class name of
* an implementation of CustomFileGrouper that's on the classpath.
* If this argument is not used, no particular grouping logic will be applied.
*/
public interface CustomFileGrouper {
Collection<Collection<Pair<SnapshotFileInfo, Long>>>
getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>>
snapshotFiles);
}
/**
* If desired, you may implement a FileLocationResolver in order to influence
the _location_
* metadata attached to each {@link InputSplit} that ExportSnapshot will
submit to YARN. The
* method {@link #getLocationsForInputFiles(Collection)} method is called
once for each InputSplit
* being constructed. Whatever is returned will ultimately be reported by
that split's
* {@link InputSplit#getLocations()} method. This can be used to encourage
YARN to schedule
* the ExportSnapshot's mappers on rack-local or host-local NodeManagers.
* To use this, pass the --file-location-resolver argument with the fully
qualified class name of
* an implementation of FileLocationResolver that's on the classpath.
* If this argument is not used, no locations will be attached to the
InputSplits.
*/
public interface FileLocationResolver {
Set<String> getLocationsForInputFiles(final
Collection<Pair<SnapshotFileInfo, Long>> files);
}
{code}
Users can optionally provide implementations of these interfaces on their
classpath, and tell ExportSnapshot to use them via new options. By default,
there will be no change in behavior. If users choose to implement these
plugins, they can influence ExportSnapshot to be topology-aware in a very
flexible way. I plan to write my own plugins optimized for AWS pricing, but
that won't be the only way this can be used.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)