[ 
https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981048#action_12981048
 ] 

stack commented on HBASE-2856:
------------------------------

Just had good conversation with Ryan.  We conclude that using the HLog sequence 
number is NOT a good idea, mostly for performance reasons.  Too many updates 
will be stuck waiting on the completion of edits that may have started before 
our update but that have yet to complete (we do not want to return to the 
client until all transaction started before ours -- but that are slower than 
ours to run --  have completed else there is the danger of not being able to 
see what you have written).  Instead, we need to keep a running sequence number 
that is per HRegion rather than per HRegionServer as HLog sequence number is.  
This new HRegion sequence number  is very much like HLog sequence number in 
that on open of HRegion we read in the largest and then increment from there.

Let me try and explain how we arrived at this notion.

We do ACID - - prevent readers reading part of an update --  by only letting 
clients (scanners and gets) read stuff that has been fully committed.  
Currently we do this by moving forward a monotonically increasing  'read 
point'.  Each update is given a write point.  The read point is moved forward 
to encompass all completed write points or  'transactions'.  Transactions 
complete willy-nilly but the read point will not move beyond the incomplete.

Here are the coarse steps involved in a 'transaction':

{code}
(0) row lock (Put, Increment, etc.)
(1) Go to WAL
(2) get new sequence id
(3) actually write WAL
(4) update memstore
(5) wait for our edit to be visible
(6) commit/move forward the read point 
(7) undo rowlock
{code}

Up to this, the way we did 'ACID' was around memstore only.  The readpoint is 
kept up inside in an instance of RWCC.  A RWCC instance is Region scoped (one 
is created on creation of a HRegion).  A new writepoint is created when we go 
to write the memstore in step (4) above and then the readpoint is moved forward 
to match the writepoint just before we do step (7) in the above.  Currently our 
RWCC transaction spans step (4) to (7) roughly.

"Wait to be visible" in the above means wait until all transactions that have 
an id that is less than mine complete before I proceed to update the read point 
and return to the client. A transaction that started before us may not complete 
until after ours because of thread scheduling, hiccups, etc.  We do not want to 
move the read point forward until all updates previous to ours have completed 
else we'll be letting clients read the incomplete earlier transactions.

Of note in the above, how long the WAL takes is not part of a RWCC transaction.

IF we move to using HLog sequence numbers, now the transaction starts at step 
(1) when we go to the WAL.  We'll need to update in RWCC the writepoint at step 
(1).  The HLog sequence number is for all of the region server, its not just 
HRegion scoped.   The 'wait for our edit to be visible' will be dependent now 
on the completion on edits against unrelated HRegions whose character may be 
completely different (e.g. the schema on HRegion A may be for increments 
whereas the schema on HRegion B may be for fat batches of cells.  If both are 
on the same regionserver, the 'wait for our edit to be visible' may have the 
increments waiting on the completion of a fat batch of updates).

So, the thought is instead to have a per region sequence number with the write 
point updated only after we emerge from the WAL append.  We keep the current 
'transaction' scope where scope is between steps (4) and (7) in the above.

I'm going to go implement the per region edit number unless an alternative 
suggested.

> TestAcidGuarantee broken on trunk 
> ----------------------------------
>
>                 Key: HBASE-2856
>                 URL: https://issues.apache.org/jira/browse/HBASE-2856
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100621
>            Reporter: ryan rawson
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 2856-v2.txt, 2856-v3.txt, acid.txt
>
>
> TestAcidGuarantee has a test whereby it attempts to read a number of columns 
> from a row, and every so often the first column of N is different, when it 
> should be the same.  This is a bug deep inside the scanner whereby the first 
> peek() of a row is done at time T then the rest of the read is done at T+1 
> after a flush, thus the memstoreTS data is lost, and previously 'uncommitted' 
> data becomes committed and flushed to disk.
> One possible solution is to introduce the memstoreTS (or similarly equivalent 
> value) to the HFile thus allowing us to preserve read consistency past 
> flushes.  Another solution involves fixing the scanners so that peek() is not 
> destructive (and thus might return different things at different times alas).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to