Charles Connell created HBASE-29432: ---------------------------------------
Summary: ExportSnapshot should support rack-awareness Key: HBASE-29432 URL: https://issues.apache.org/jira/browse/HBASE-29432 Project: HBase Issue Type: Improvement Reporter: Charles Connell Assignee: Charles Connell At my company we are using ExportSnapshot to copy HBase table snapshots to S3, as a backup strategy. ExportSnapshot launches a MapReduce job to perform the copy. This means that data flows from the HBase cluster's DataNodes, to a YARN cluster's nodes, and then to S3. We are running HBase in AWS. AWS charges a fee for cross-availability-zone network traffic, but not for same-availability-zone traffic. If we could make the DataNode -> YARN node traffic not need to cross availability zones, backups would be considerably cheaper. I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and a FileLocationResolver. Here's what they will look like: {code} /** * If desired, you may implement a CustomFileGrouper in order to influence how ExportSnapshot * chooses which input files go into the MapReduce job's {@link InputSplit}s. Your implementation * must return a data structure that contains each input file exactly once. Files that appear in * separate entries in the top-level returned Collection are guaranteed to not be placed in the * same InputSplit. * This can be used to segregate your input files by the rack or host on which they are available, * which, used in conjunction with {@link FileLocationResolver}, can improve the performance * of your ExportSnapshot runs. * To use this, pass the --custom-file-grouper argument with the fully qualified class name of * an implementation of CustomFileGrouper that's on the classpath. * If this argument is not used, no particular grouping logic will be applied. */ public interface CustomFileGrouper { Collection<Collection<Pair<SnapshotFileInfo, Long>>> getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>> snapshotFiles); } /** * If desired, you may implement a FileLocationResolver in order to influence the _location_ * metadata attached to each {@link InputSplit} that ExportSnapshot will submit to YARN. The * method {@link #getLocationsForInputFiles(Collection)} method is called once for each InputSplit * being constructed. Whatever is returned will ultimately be reported by that split's * {@link InputSplit#getLocations()} method. This can be used to encourage YARN to schedule * the ExportSnapshot's mappers on rack-local or host-local NodeManagers. * To use this, pass the --file-location-resolver argument with the fully qualified class name of * an implementation of FileLocationResolver that's on the classpath. * If this argument is not used, no locations will be attached to the InputSplits. */ public interface FileLocationResolver { Set<String> getLocationsForInputFiles(final Collection<Pair<SnapshotFileInfo, Long>> files); } {code} Users can optionally provide implementations of these interfaces on their classpath, and tell ExportSnapshot to use them via new options. By default, there will be no change in behavior. If users choose to implement these plugins, they can influence ExportSnapshot to be topology-aware in a very flexible way. I plan to write my own plugins optimized for AWS pricing, but that won't be the only way this can be used. -- This message was sent by Atlassian Jira (v8.20.10#820010)