[
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chinmay Kulkarni updated PHOENIX-6081:
--------------------------------------
Affects Version/s: (was: master)
5.0.0
> Improvements to snapshot based MR input format
> ----------------------------------------------
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
> Issue Type: Improvement
> Components: core
> Affects Versions: 5.0.0, 4.14.3
> Reporter: Bharath Vissapragada
> Priority: Major
>
> Recently we switched an MR application from scanning live tables to scanning
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which
> turned out to a correctness issue due to over-lapping scan splits generation.
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even
> with that fix there are quite a few things we could improve about the
> snapshot based input format. Listing them here, perhaps we can break them
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot
> once per map task into a temp directory. For large tables on big clusters,
> this creates a storm of NN RPCs. We can do this once per job and let all the
> map tasks operate on the same restored snapshot. HBase already did this via
> HBASE-18806, we can do something similar.
> - Disable
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
> on scans generated by input format. In our experiments block cache took a
> lot of memory for MR jobs. For full table scans this isn't of much use and
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List<InputSplit> generateSplits(final QueryPlan qplan, Configuration
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the
> query plan
> ....
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
>
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator =
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator,
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that
> the snapshot based codepaths do minimal HBase RPCs and rely solely on
> snapshot manifest. Even better, improve locality of task scheduling based on
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an
> index based access path, getScans() can potentially return index based splits
> which is not what we want for snapshots.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)