[ https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121247#comment-17121247 ]
Geoffrey Jacoby commented on HBASE-24440: ----------------------------------------- [~apurtell] - in HBase 2.x and above, the sort-delete-before-put rule is configurable. (see 29.3 in the HBase book). It can be disabled at the cost of some CPU perf on read. > Prevent temporal misordering on timescales smaller than one clock tick > ---------------------------------------------------------------------- > > Key: HBASE-24440 > URL: https://issues.apache.org/jira/browse/HBASE-24440 > Project: HBase > Issue Type: Brainstorming > Reporter: Andrew Kyle Purtell > Priority: Major > > When mutations are sent to the servers without a timestamp explicitly > assigned by the client the server will substitute the current wall clock > time. There are edge cases where it is at least theoretically possible for > more than one mutation to be committed to a given row within the same clock > tick. When this happens we have to track and preserve the ordering of these > mutations in some other way besides the timestamp component of the key. Let > me bypass most discussion here by noting that whether we do this or not, we > do not pass such ordering information in the cross cluster replication > protocol. We also have interesting edge cases regarding key type precedence > when mutations arrive "simultaneously": we sort deletes ahead of puts. This, > especially in the presence of replication, can lead to visible anomalies for > clients able to interact with both source and sink. > There is a simple solution that removes the possibility that these edge cases > can occur: > We can detect, when we are about to commit a mutation to a row, if we have > already committed a mutation to this same row in the current clock tick. > Occurrences of this condition will be rare. We are already tracking current > time. We have to know this in order to assign the timestamp. Where this > becomes interesting is how we might track the last commit time per row. > Making the detection of this case efficient for the normal code path is the > bulk of the challenge. One option is to keep track of the last locked time > for row locks. (Todo: How would we track and garbage collect this efficiently > and correctly. Not the ideal option.) We might also do this tracking somehow > via the memstore. (At least in this case the lifetime and distribution of in > memory row state, including the proposed timestamps, would align.) Assuming > we can efficiently know if we are about to commit twice to the same row > within a single clock tick, we would simply sleep/yield the current thread > until the clock ticks over, and then proceed. -- This message was sent by Atlassian Jira (v8.3.4#803005)