[ https://issues.apache.org/jira/browse/HBASE-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675743#comment-15675743 ]
Nick Dimiduk commented on HBASE-16820: -------------------------------------- Thanks for the explanation. Where are you guys with this one? We can at least see what QA says about the patch. Do we have a mixed online + bulkload ITBLL variant we can use to test it? > BulkLoad mvcc visibility only works accidentally > ------------------------------------------------- > > Key: HBASE-16820 > URL: https://issues.apache.org/jira/browse/HBASE-16820 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Attachments: HBASE-16820-branch-1.1-v0.patch > > > [~sergey.soldatov] has been debugging an issue with a 1.1 code base where the > commit for HBASE-16721 broke the bulk load visibility. After bulk load, the > bulk load files is not visible because the sequence id assigned to the bulk > load is not advanced in mvcc. > Debugging further, we have noticed that bulk load behavior is wrong, but it > works "accidentally" in all code bases (but broken in 1.1 after HBASE-16721). > Let me explain: > - BL request can optionally request a flush before hand (this should be the > default) which causes the flush to happen with some sequenceId. The flush > sequence id is one past all the cells' sequenceids. This flush sequence id is > returned as a result to the flush operation. > - BL then uses this particular sequenceId to mark the files, but itself does > not get a new sequenceid of its own, or advance the mvcc number. > - BL completes WITHOUT making sure that the sequence id is visible. > - BL itself though writes entries to the WAL for the BL event, which in 1.2 > code bases goes through the whole mvcc + seqId paths, which makes sure that > earlier sequenceIds (the flush sequenceId) are visible via mvcc. > The problem with 1.1 is that the WAL entries only get sequence ids, but do > not touch mvcc. With the patch for HBASE-16721, we have made it so that the > flushedSequenceId is not used in mvcc as the highest read point (although all > the data is still visible). > BL relying on the flush sequence id is wrong for two reasons: > - BL files are loaded with the flush sequence id from the memstore. This > particular sequence id is used twice for two different things and ends up > being the sequence id for flushed file as well as BL'ed files. > - BL should make sure that it gets a new sequence id and that sequence id is > visible before returning the results. > [~ndimiduk] FYI. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)