[ https://issues.apache.org/jira/browse/HBASE-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635960#comment-13635960 ]
Sergey Shelukhin edited comment on HBASE-3787 at 4/19/13 2:20 AM: ------------------------------------------------------------------ Well, I've spent today writing some code, and then we discussed it with [~enis]. There is a huge number of special cases all over the place... First, on which nonces to use. Unfortunately my proposed approach is not as simple as it seems due to special cases :) In case of hash nonces: * you have an epic hashmap of all nonces over the past expiration interval (say, one hour) * on common successful operation path, for every op you do a putIfAbsent into this map, and later take an uncontested lock to check for waiting conflicts * cleanup iterates the map and removes expired entries * code is simple * *retries over expiration time will result in duplicate operations* * increasing expiration time is limited by total (map) size In case of sequential nonces: * you have a small map by client, some smaller nonce structures inside; easier to cleanup so probably smaller/much smaller in total * however, in common path, you do small hashmap lookup, one smaller structure (say another hashmap) putIfAbsent, take the same uncontested lock as in case 1, and do 2 interlocked ops * cleanup is still very simple * however, to make cleanup and main path simple, the code for starting/ending nonce operation has to have a lot of interlocked/volatile/etc. cleverness which will almost never execute, to handle special cases * *retries over expiration time #1 will result in rejection - no dup, /importance of which depends on WAL design below/* * increasing expiration time #1 is limited by total size * *retries over expiration time #2 will result in duplicate change* * increasing expiration time #2 is only limited by number of *clients* (not nonces), so it's a much larger margin of safety (which may or may not be worth it) Finally, to make full use of range collapsing as suggested above, you need to factor region into clientId on client (not on server, for server clientId is opaque), so that the sequence of numbers following each other goes into the same region. That will allow one to use structure better than hashmap for nonces above, reduce memory footprint a lot, and increase the time for nonce expiration. h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, at least for the first cut. Then; there are some easy-to-handle special cases like splits and merges. Main problem for any case is WAL recovery. First, we will have to read nonces from entire WAL, not just the records we recover, because otherwise nonces won't work for records that got flushed (we don't recover WAL below some watermark, for records are already in store). Second, even if we do read records for entire WAL, WAL can go away very quickly after all records make it to FS, so we won't have nonces from it, at all. One option is to have is to keep WAL around for nonce recovery period. Alternatively, we can have separate additional "log" file for nonces. It will just contain bunch of numbers. Flushing it will not be on main path - because WAL itself also needs to contain nonces (for replication at least), we can flush the nonce log only before memstore flush. So when we recover, for a given nonce we will either see it in the WAL (if it was never flushed to disk, it will be part of recovery), or we'll see it as part of the nonce log (or both occasionally, which doesn't matter). Nonce log can be rolled independently and nuked after a file is at least expiration-time old. A radically different solution that [~enis] proposed is to output increments and appends as separate markers (like delete markers), containing nonces as Cell tags or shadow columns, and coalesce them on reads, and during compactions after some time. When we coalesce them to get final value we will throw away the extra ones. This way we get rid of all the above complexity because the nonce management is just part of normal KV management. However, we may introduce a lot of other special case around out of order puts/deletes, number of versions to keep (increments/appends will need special accounting to keep version semantics). Plus coalescing the value from some often-incremented field may be expensive. It will also allow us to support out-of-order increments and appends! Just kidding. h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for feedback, doing other things (or until enough feedback accumulates :)). Then, I will stash my sequential nonces code, and do the simplest hash nonces patch possible, including sending a summary of nonces to server during WAL recovery from whatever WAL we are currently reading, including below watermark, without actually replaying the KVs. It will not be bulletproof, but a first step if there are no objections :) was (Author: sershe): Well, I've spent today writing some code, and then we discussed it with [~enis]. There is a huge number of special cases all over the place... First, on which nonces to use. Unfortunately my proposed approach is not as simple as it seems due to special cases :) In case of hash nonces: you have an epic hashmap of all nonces over the past expiration interval (say, one hour) on common successful operation path, for every op you do a putIfAbsent into this map, and later take an uncontested lock to check for waiting conflicts cleanup iterates the map and removes expired entries code is simple *retries over expiration time will result in duplicate operations* increasing expiration time is limited by total (map) size In case of sequential nonces: you have a small map by client, some smaller nonce structures inside; easier to cleanup so probably smaller/much smaller in total however, in common path, you do small hashmap lookup, one smaller structure (say another hashmap) putIfAbsent, take the same uncontested lock as in case 1, and do 2 interlocked ops cleanup is still very simple however, to make cleanup and main path simple, the code for starting/ending nonce operation has to have a lot of interlocked/volatile/etc. cleverness which will almost never execute, to handle special cases *retries over expiration time #1 will result in rejection - no dup, /importance of which depends on WAL design below/* increasing expiration time #1 is limited by total size *retries over expiration time #2 will result in duplicate change* increasing expiration time #2 is only limited by number of *clients* (not nonces), so it's a much larger margin of safety (which may or may not be worth it) Finally, to make full use of range collapsing as suggested above, you need to factor region into clientId on client (not on server, for server clientId is opaque), so that the sequence of numbers following each other goes into the same region. That will allow one to use structure better than hashmap for nonces above, reduce memory footprint a lot, and increase the time for nonce expiration. h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, at least for the first cut. Then; there are some easy-to-handle special cases like splits and merges. Main problem for any case is WAL recovery. First, we will have to read nonces from entire WAL, not just the records we recover, because otherwise nonces won't work for records that got flushed (we don't recover WAL below some watermark, for records are already in store). Second, even if we do read records for entire WAL, WAL can go away very quickly after all records make it to FS, so we won't have nonces from it, at all. One option is to have is to keep WAL around for nonce recovery period. Alternatively, we can have separate additional "log" file for nonces. It will just contain bunch of numbers. Flushing it will not be on main path - because WAL itself also needs to contain nonces (for replication at least), we can flush the nonce log only before memstore flush. So when we recover, for a given nonce we will either see it in the WAL (if it was never flushed to disk, it will be part of recovery), or we'll see it as part of the nonce log (or both occasionally, which doesn't matter). Nonce log can be rolled independently and nuked after a file is at least expiration-time old. A radically different solution that [~enis] proposed is to output increments and appends as separate markers (like delete markers), containing nonces as Cell tags or shadow columns, and coalesce them on reads, and during compactions after some time. When we coalesce them to get final value we will throw away the extra ones. This way we get rid of all the above complexity because the nonce management is just part of normal KV management. However, we may introduce a lot of other special case around out of order puts/deletes, number of versions to keep (increments/appends will need special accounting to keep version semantics). Plus coalescing the value from some often-incremented field may be expensive. It will also allow us to support out-of-order increments and appends! Just kidding. h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for feedback, doing other things (or until enough feedback accumulates :)). Then, I will stash my sequential nonces code, and do the simplest hash nonces patch possible, including sending a summary of nonces to server during WAL recovery from whatever WAL we are currently reading, including below watermark, without actually replaying the KVs. It will not be bulletproof, but a first step if there are no objections :) > Increment is non-idempotent but client retries RPC > -------------------------------------------------- > > Key: HBASE-3787 > URL: https://issues.apache.org/jira/browse/HBASE-3787 > Project: HBase > Issue Type: Bug > Components: Client > Affects Versions: 0.94.4, 0.95.2 > Reporter: dhruba borthakur > Assignee: Sergey Shelukhin > Priority: Critical > Fix For: 0.95.1 > > Attachments: HBASE-3787-partial.patch > > > The HTable.increment() operation is non-idempotent. The client retries the > increment RPC a few times (as specified by configuration) before throwing an > error to the application. This makes it possible that the same increment call > be applied twice at the server. > For increment operations, is it better to use > HConnectionManager.getRegionServerWithoutRetries()? Another option would be > to enhance the IPC module to make the RPC server correctly identify if the > RPC is a retry attempt and handle accordingly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira