Xu Cang created HBASE-24028:
-------------------------------
Summary: MapReduce on snapshot restores and opens all regions in
each mapper
Key: HBASE-24028
URL: https://issues.apache.org/jira/browse/HBASE-24028
Project: HBase
Issue Type: Bug
Affects Versions: 1.6.0, 2.3.0
Reporter: Xu Cang
Given this scenario: one MR job scans a table (with many regions). I will use
'RestoreSnapshotHelper' to restore snapshot for all regions in each mapper.
In the code
[https://github.com/apache/hbase/blob/branch-2.0/hbase-server/src/main/java/org/apache/hadoop/hbase/snapshot/RestoreSnapshotHelper.java#L183]
Seems there is no way to only restore relevant regions from snapshot to region.
This leads to extreme slowness and waste of resource.
Please correct me if I am wrong or miss anything. thanks.
One quick example I san show as below, in my test, there are 2 regions in a
testing table. and each mapper opens and iterates 2 regions.
2020-03-19 18:58:15,225 INFO [main] mapred.MapTask - Map output collector class
= org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2020-03-19 18:58:15,285 INFO [main] snapshot.RestoreSnapshotHelper - region to
add: *d7f85b4a9d3fa22a5e7b88bda39f6d50*
2020-03-19 18:58:15,285 INFO [main] snapshot.RestoreSnapshotHelper - region to
add: *69dd3fdba3698f827f8883ed911161ef*
2020-03-19 18:58:15,286 INFO [main] snapshot.RestoreSnapshotHelper - clone
region=d7f85b4a9d3fa22a5e7b88bda39f6d50 as d7f85b4a9d3fa22a5e7b88bda39f6d50
So if I misunderstood anything, can anyone point to me where in this class, can
distinguish which region to go through for different mappers?
btw the original implementation for MR on Snapshot is here, there weren't too
many big changes after that HBASE-8369
--
This message was sent by Atlassian Jira
(v8.3.4#803005)