[jira] [Commented] (HBASE-15097) When the scan operation covered two regions,sometimes the final results have duplicated rows.

chenrongwei (JIRA) Wed, 13 Jan 2016 04:53:00 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-15097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096136#comment-15096136
 ]


chenrongwei commented on HBASE-15097:
-------------------------------------

I think there maybe exist some bug in the progress of region splitting which 
leads to the region still keep the data beyond its' end key.
Here is my test code,
    public static void main(String[] args) {
        Configuration configuration = HBaseConfiguration.create();
        Connection connection = null;
        try {
            FileOutputStream output = new FileOutputStream("rowkey.txt");
            connection = ConnectionFactory.createConnection(configuration);
            TableName tableName = TableName.valueOf("xsearch_solr");
            Table theTestTable = connection.getTable(tableName);
            Scan scan = new Scan(Bytes.toBytes("bbf8f2d40000000000232958622"),
                    Bytes.toBytes("bff8f2d40000000000232958623"));
            scan.setCaching(4000);
            scan.setMaxVersions(1);
            scan.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("userid"));
            long beginTime = System.nanoTime();
            int hits = 0;
            ResultScanner resultScanner = theTestTable.getScanner(scan);
            Result[] results = resultScanner.next(4000);
            while (results != null && results.length > 0) {
                for (Result aResult : results) {
                    output.write(aResult.getRow());
                    output.write("\n".getBytes());
                    if 
("bff8f2d40000000000232958622".equals(Bytes.toString(aResult.getRow()))) {
                        System.out.println("rowid=" + 
Bytes.toString(aResult.getRow()) + ",timestamp=" + aResult
                                .getColumnLatestCell(Bytes.toBytes("cf"), 
Bytes.toBytes("userid")).getTimestamp());
                        hits++;
                    }
                }
                results = resultScanner.next(4000);
            }

            long endTime = System.nanoTime();
            output.close();
            System.out.println("query cost=" + (endTime - beginTime) + "ns" + 
",hits=" + hits);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (connection != null) {
                try {
                    connection.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

the code's output is like bellow,

rowid=bff8f2d40000000000232958622,timestamp=1452223831551
rowid=bff8f2d40000000000232958622,timestamp=1452685378997
query cost=24923466628ns,hits=2

Please check the snapshot file to get the the table's current region info, and 
you can check the rowkey.txt find the duplicated rows,such as 
'bff8e36c0000000000244275031'.

I had checked the trace log file 'output.log', then i found that the scan 
operation's detail info,but i don't know why the region's hfile still keep the 
old data which has beyond its end key.




> When the scan operation covered two regions,sometimes the final results have 
> duplicated rows.
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-15097
>                 URL: https://issues.apache.org/jira/browse/HBASE-15097
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.1.2
>         Environment: centos 6.5
> hbase 1.1.2 
>            Reporter: chenrongwei
>            Assignee: chenrongwei
>             Fix For: 1.1.2
>
>         Attachments: output.log, rowkey.txt, snapshot2016-01-13 pm 8.42.37.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When the scan operation‘s start key and end key covered two regions,the first 
> region returned the rows which were beyond of its' end key.So,this finally 
> leads to duplicated rows in the results.
> To avoid this problem,we should add a judgment before setting the variable 
> "stopRow" in the class of HRegion,like follow:
>             if (Bytes.equals(scan.getStopRow(), HConstants.EMPTY_END_ROW) && 
> !scan.isGetScan()) {
>                 this.stopRow = null;
>             } else {
>                 if (Bytes.compareTo(scan.getStopRow(), 
> this.getRegionInfo().getEndKey()) >= 0) {
>                     this.stopRow = this.getRegionInfo().getEndKey();
>                 } else {
>                     this.stopRow = scan.getStopRow();
>                 }
>             }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-15097) When the scan operation covered two regions,sometimes the final results have duplicated rows.

Reply via email to