[ 
https://issues.apache.org/jira/browse/HBASE-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635960#comment-13635960
 ] 

Sergey Shelukhin edited comment on HBASE-3787 at 4/19/13 2:20 AM:
------------------------------------------------------------------

Well, I've spent today writing some code, and then we discussed it with 
[~enis]. There is a huge number of special cases all over the place...

First, on which nonces to use. Unfortunately my proposed approach is not as 
simple as it seems due to special cases :)
In case of hash nonces:
* you have an epic hashmap of all nonces over the past expiration interval 
(say, one hour)
* on common successful operation path, for every op you do a putIfAbsent into 
this map, and later take an uncontested lock to check for waiting conflicts
* cleanup iterates the map and removes expired entries
* code is simple
* *retries over expiration time will result in duplicate operations*
* increasing expiration time is limited by total (map) size
In case of sequential nonces:
* you have a small map by client, some smaller nonce structures inside; easier 
to cleanup so probably smaller/much smaller in total
* however, in common path, you do small hashmap lookup, one smaller structure 
(say another hashmap) putIfAbsent, take the same uncontested lock as in case 1, 
and do 2 interlocked ops
* cleanup is still very simple
* however, to make cleanup and main path simple, the code for starting/ending 
nonce operation has to have a lot of interlocked/volatile/etc. cleverness which 
will almost never execute, to handle special cases
* *retries over expiration time #1 will result in rejection - no dup, 
/importance of which depends on WAL design below/*
* increasing expiration time #1 is limited by total size
* *retries over expiration time #2 will result in duplicate change*
* increasing expiration time #2 is only limited by number of *clients* (not 
nonces), so it's a much larger margin of safety (which may or may not be worth 
it)
Finally, to make full use of range collapsing as suggested above, you need to 
factor region into clientId on client (not on server, for server clientId is 
opaque), so that the sequence of numbers following each other goes into the 
same region.
That will allow one to use structure better than hashmap for nonces above, 
reduce memory footprint a lot, and increase the time for nonce expiration.

h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, 
at least for the first cut.

Then; there are some easy-to-handle special cases like splits and merges.

Main problem for any case is WAL recovery. First, we will have to read nonces 
from entire WAL, not just the records we recover, because otherwise nonces 
won't work for records that got flushed (we don't recover WAL below some 
watermark, for records are already in store).
Second, even if we do read records for entire WAL, WAL can go away very quickly 
after all records make it to FS, so we won't have nonces from it, at all.
One option is to have is to keep WAL around for nonce recovery period.
Alternatively, we can have separate additional "log" file for nonces. It will 
just contain bunch of numbers. Flushing it will not be on main path - because 
WAL itself also needs to contain nonces (for replication at least), we can 
flush the nonce log only before memstore flush.
So when we recover, for a given nonce we will either see it in the WAL (if it 
was never flushed to disk, it will be part of recovery), or we'll see it as 
part of the nonce log (or both occasionally, which doesn't matter).
Nonce log can be rolled independently and nuked after a file is at least 
expiration-time old.

A radically different solution that [~enis] proposed is to output increments 
and appends as separate markers (like delete markers), containing nonces as 
Cell tags or shadow columns, and coalesce them on reads, and during compactions 
after some time.
When we coalesce them to get final value we will throw away the extra ones.
This way we get rid of all the above complexity because the nonce management is 
just part of normal KV management.
However, we may introduce a lot of other special case around out of order 
puts/deletes, number of versions to keep (increments/appends will need special 
accounting to keep version semantics). Plus coalescing the value from some 
often-incremented field may be expensive.
It will also allow us to support out-of-order increments and appends! Just 
kidding.

h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for 
feedback, doing other things (or until enough feedback accumulates :)). Then, I 
will stash my sequential nonces code, and do the simplest hash nonces patch 
possible, including sending a summary of nonces to server during WAL recovery 
from whatever WAL we are currently reading, including below watermark, without 
actually replaying the KVs. It will not be bulletproof, but a first step if 
there are no objections :)

                
      was (Author: sershe):
    Well, I've spent today writing some code, and then we discussed it with 
[~enis]. There is a huge number of special cases all over the place...

First, on which nonces to use. Unfortunately my proposed approach is not as 
simple as it seems due to special cases :)
In case of hash nonces:
  you have an epic hashmap of all nonces over the past expiration interval 
(say, one hour)
  on common successful operation path, for every op you do a putIfAbsent into 
this map, and later take an uncontested lock to check for waiting conflicts
  cleanup iterates the map and removes expired entries
  code is simple
  *retries over expiration time will result in duplicate operations*
  increasing expiration time is limited by total (map) size
In case of sequential nonces:
  you have a small map by client, some smaller nonce structures inside; easier 
to cleanup so probably smaller/much smaller in total
  however, in common path, you do small hashmap lookup, one smaller structure 
(say another hashmap) putIfAbsent, take the same uncontested lock as in case 1, 
and do 2 interlocked ops
  cleanup is still very simple
  however, to make cleanup and main path simple, the code for starting/ending 
nonce operation has to have a lot of interlocked/volatile/etc. cleverness which 
will almost never execute, to handle special cases
  *retries over expiration time #1 will result in rejection - no dup, 
/importance of which depends on WAL design below/*
  increasing expiration time #1 is limited by total size
  *retries over expiration time #2 will result in duplicate change*
  increasing expiration time #2 is only limited by number of *clients* (not 
nonces), so it's a much larger margin of safety (which may or may not be worth 
it)
Finally, to make full use of range collapsing as suggested above, you need to 
factor region into clientId on client (not on server, for server clientId is 
opaque), so that the sequence of numbers following each other goes into the 
same region.
That will allow one to use structure better than hashmap for nonces above, 
reduce memory footprint a lot, and increase the time for nonce expiration.

h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, 
at least for the first cut.

Then; there are some easy-to-handle special cases like splits and merges.

Main problem for any case is WAL recovery. First, we will have to read nonces 
from entire WAL, not just the records we recover, because otherwise nonces 
won't work for records that got flushed (we don't recover WAL below some 
watermark, for records are already in store).
Second, even if we do read records for entire WAL, WAL can go away very quickly 
after all records make it to FS, so we won't have nonces from it, at all.
One option is to have is to keep WAL around for nonce recovery period.
Alternatively, we can have separate additional "log" file for nonces. It will 
just contain bunch of numbers. Flushing it will not be on main path - because 
WAL itself also needs to contain nonces (for replication at least), we can 
flush the nonce log only before memstore flush.
So when we recover, for a given nonce we will either see it in the WAL (if it 
was never flushed to disk, it will be part of recovery), or we'll see it as 
part of the nonce log (or both occasionally, which doesn't matter).
Nonce log can be rolled independently and nuked after a file is at least 
expiration-time old.

A radically different solution that [~enis] proposed is to output increments 
and appends as separate markers (like delete markers), containing nonces as 
Cell tags or shadow columns, and coalesce them on reads, and during compactions 
after some time.
When we coalesce them to get final value we will throw away the extra ones.
This way we get rid of all the above complexity because the nonce management is 
just part of normal KV management.
However, we may introduce a lot of other special case around out of order 
puts/deletes, number of versions to keep (increments/appends will need special 
accounting to keep version semantics). Plus coalescing the value from some 
often-incremented field may be expensive.
It will also allow us to support out-of-order increments and appends! Just 
kidding.

h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for 
feedback, doing other things (or until enough feedback accumulates :)). Then, I 
will stash my sequential nonces code, and do the simplest hash nonces patch 
possible, including sending a summary of nonces to server during WAL recovery 
from whatever WAL we are currently reading, including below watermark, without 
actually replaying the KVs. It will not be bulletproof, but a first step if 
there are no objections :)

                  
> Increment is non-idempotent but client retries RPC
> --------------------------------------------------
>
>                 Key: HBASE-3787
>                 URL: https://issues.apache.org/jira/browse/HBASE-3787
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.94.4, 0.95.2
>            Reporter: dhruba borthakur
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>             Fix For: 0.95.1
>
>         Attachments: HBASE-3787-partial.patch
>
>
> The HTable.increment() operation is non-idempotent. The client retries the 
> increment RPC a few times (as specified by configuration) before throwing an 
> error to the application. This makes it possible that the same increment call 
> be applied twice at the server.
> For increment operations, is it better to use 
> HConnectionManager.getRegionServerWithoutRetries()? Another  option would be 
> to enhance the IPC module to make the RPC server correctly identify if the 
> RPC is a retry attempt and handle accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to