[ https://issues.apache.org/jira/browse/HBASE-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Charles Connell resolved HBASE-29432. ------------------------------------- Resolution: Fixed > ExportSnapshot should support rack-awareness > -------------------------------------------- > > Key: HBASE-29432 > URL: https://issues.apache.org/jira/browse/HBASE-29432 > Project: HBase > Issue Type: Improvement > Reporter: Charles Connell > Assignee: Charles Connell > Priority: Minor > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3 > > > At my company we are using ExportSnapshot to copy HBase table snapshots to > S3, as a backup strategy. ExportSnapshot launches a MapReduce job to perform > the copy. This means that data flows from the HBase cluster's DataNodes, to a > YARN cluster's nodes, and then to S3. > We are running HBase and YARN in AWS. AWS charges a fee for > cross-availability-zone network traffic, but not for same-availability-zone > traffic. If we could make the DataNode -> YARN node traffic not cross > availability zones, backups would be considerably cheaper. > I propose to make ExposeSnapshot accept two plugins: a CustomFileGrouper and > a FileLocationResolver. Here's what they will look like: > {code} > /** > * If desired, you may implement a CustomFileGrouper in order to influence > how ExportSnapshot > * chooses which input files go into the MapReduce job's {@link > InputSplit}s. Your implementation > * must return a data structure that contains each input file exactly once. > Files that appear in > * separate entries in the top-level returned Collection are guaranteed to > not be placed in the > * same InputSplit. > * This can be used to segregate your input files by the rack or host on > which they are available, > * which, used in conjunction with {@link FileLocationResolver}, can > improve the performance > * of your ExportSnapshot runs. > * To use this, pass the --custom-file-grouper argument with the fully > qualified class name of > * an implementation of CustomFileGrouper that's on the classpath. > * If this argument is not used, no particular grouping logic will be > applied. > */ > public interface CustomFileGrouper { > Collection<Collection<Pair<SnapshotFileInfo, Long>>> > getGroupedInputFiles(final Collection<Pair<SnapshotFileInfo, Long>> > snapshotFiles); > } > /** > * If desired, you may implement a FileLocationResolver in order to > influence the _location_ > * metadata attached to each {@link InputSplit} that ExportSnapshot will > submit to YARN. The > * method {@link #getLocationsForInputFiles(Collection)} method is called > once for each InputSplit > * being constructed. Whatever is returned will ultimately be reported by > that split's > * {@link InputSplit#getLocations()} method. This can be used to encourage > YARN to schedule > * the ExportSnapshot's mappers on rack-local or host-local NodeManagers. > * To use this, pass the --file-location-resolver argument with the fully > qualified class name of > * an implementation of FileLocationResolver that's on the classpath. > * If this argument is not used, no locations will be attached to the > InputSplits. > */ > public interface FileLocationResolver { > Set<String> getLocationsForInputFiles(final > Collection<Pair<SnapshotFileInfo, Long>> files); > } > {code} > Users can optionally provide implementations of these interfaces on their > classpath, and tell ExportSnapshot to use them via new options. By default, > there will be no change in behavior. If users choose to implement these > plugins, they can influence ExportSnapshot to be topology-aware in a very > flexible way. I plan to write my own plugins optimized for AWS pricing, but > that won't be the only way this can be used. -- This message was sent by Atlassian Jira (v8.20.10#820010)