[ https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981048#action_12981048 ]
stack commented on HBASE-2856: ------------------------------ Just had good conversation with Ryan. We conclude that using the HLog sequence number is NOT a good idea, mostly for performance reasons. Too many updates will be stuck waiting on the completion of edits that may have started before our update but that have yet to complete (we do not want to return to the client until all transaction started before ours -- but that are slower than ours to run -- have completed else there is the danger of not being able to see what you have written). Instead, we need to keep a running sequence number that is per HRegion rather than per HRegionServer as HLog sequence number is. This new HRegion sequence number is very much like HLog sequence number in that on open of HRegion we read in the largest and then increment from there. Let me try and explain how we arrived at this notion. We do ACID - - prevent readers reading part of an update -- by only letting clients (scanners and gets) read stuff that has been fully committed. Currently we do this by moving forward a monotonically increasing 'read point'. Each update is given a write point. The read point is moved forward to encompass all completed write points or 'transactions'. Transactions complete willy-nilly but the read point will not move beyond the incomplete. Here are the coarse steps involved in a 'transaction': {code} (0) row lock (Put, Increment, etc.) (1) Go to WAL (2) get new sequence id (3) actually write WAL (4) update memstore (5) wait for our edit to be visible (6) commit/move forward the read point (7) undo rowlock {code} Up to this, the way we did 'ACID' was around memstore only. The readpoint is kept up inside in an instance of RWCC. A RWCC instance is Region scoped (one is created on creation of a HRegion). A new writepoint is created when we go to write the memstore in step (4) above and then the readpoint is moved forward to match the writepoint just before we do step (7) in the above. Currently our RWCC transaction spans step (4) to (7) roughly. "Wait to be visible" in the above means wait until all transactions that have an id that is less than mine complete before I proceed to update the read point and return to the client. A transaction that started before us may not complete until after ours because of thread scheduling, hiccups, etc. We do not want to move the read point forward until all updates previous to ours have completed else we'll be letting clients read the incomplete earlier transactions. Of note in the above, how long the WAL takes is not part of a RWCC transaction. IF we move to using HLog sequence numbers, now the transaction starts at step (1) when we go to the WAL. We'll need to update in RWCC the writepoint at step (1). The HLog sequence number is for all of the region server, its not just HRegion scoped. The 'wait for our edit to be visible' will be dependent now on the completion on edits against unrelated HRegions whose character may be completely different (e.g. the schema on HRegion A may be for increments whereas the schema on HRegion B may be for fat batches of cells. If both are on the same regionserver, the 'wait for our edit to be visible' may have the increments waiting on the completion of a fat batch of updates). So, the thought is instead to have a per region sequence number with the write point updated only after we emerge from the WAL append. We keep the current 'transaction' scope where scope is between steps (4) and (7) in the above. I'm going to go implement the per region edit number unless an alternative suggested. > TestAcidGuarantee broken on trunk > ---------------------------------- > > Key: HBASE-2856 > URL: https://issues.apache.org/jira/browse/HBASE-2856 > Project: HBase > Issue Type: Bug > Affects Versions: 0.89.20100621 > Reporter: ryan rawson > Assignee: stack > Priority: Blocker > Fix For: 0.92.0 > > Attachments: 2856-v2.txt, 2856-v3.txt, acid.txt > > > TestAcidGuarantee has a test whereby it attempts to read a number of columns > from a row, and every so often the first column of N is different, when it > should be the same. This is a bug deep inside the scanner whereby the first > peek() of a row is done at time T then the rest of the read is done at T+1 > after a flush, thus the memstoreTS data is lost, and previously 'uncommitted' > data becomes committed and flushed to disk. > One possible solution is to introduce the memstoreTS (or similarly equivalent > value) to the HFile thus allowing us to preserve read consistency past > flushes. Another solution involves fixing the scanners so that peek() is not > destructive (and thus might return different things at different times alas). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.