[ https://issues.apache.org/jira/browse/HBASE-20844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539597#comment-16539597 ]
ShivaKumar SS commented on HBASE-20844: --------------------------------------- This behaviour is not seen in hbase 1.4.5 and it turns out to be below fix missing in hbase 1.3.1, where it ignores regions which are getting split. {{Class : org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl}} Method : {{ public static List<HRegionInfo> getRegionInfosFromManifest(SnapshotManifest manifest) {}} {{ List<SnapshotRegionManifest> regionManifests = manifest.getRegionManifests();}} {{ if (regionManifests == null) {}} {{ throw new IllegalArgumentException("Snapshot seems empty");}} {{ }}} {{ List<HRegionInfo> regionInfos = Lists.newArrayListWithCapacity(regionManifests.size());}} {{ for (SnapshotRegionManifest regionManifest : regionManifests) {}} {{ HRegionInfo hri = HRegionInfo.convert(regionManifest.getRegionInfo());}} {{ if (hri.isOffline() && (hri.isSplit() || hri.isSplitParent())) { // This one.}} {{ continue;}} {{ }}} {{ regionInfos.add(hri);}} {{ }}} {{ return regionInfos;}} {{ }}} > Duplicate rows returned while hbase snapshot reads > -------------------------------------------------- > > Key: HBASE-20844 > URL: https://issues.apache.org/jira/browse/HBASE-20844 > Project: HBase > Issue Type: Bug > Components: mapreduce, snapshots, spark > Affects Versions: 1.3.1 > Environment: Cluster Details > Java 1.7 > Hbase 1.3.1 > Spark 1.6.1 > Reporter: ShivaKumar SS > Priority: Major > > We are trying to take snapshot from code and read data using MR and spark, > both approaches are returning duplicate records. > On the API side, > \{{org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat }} is used. > Snapshot was taken during the table was in a region split state. > We suspect it is due to data is being returned for both parent and daughter > regions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)