[ 
https://issues.apache.org/jira/browse/HBASE-25913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352181#comment-17352181
 ] 

Andrew Kyle Purtell edited comment on HBASE-25913 at 5/26/21, 11:22 PM:
------------------------------------------------------------------------

Once HBASE-25911 is committed I can open a PR for this one. Dropping a WIP 
patch here for now. This is the essence of the change:
{code:java}
+    @Override
+    public long currentTimeAdvancing() throws InterruptedException {
+      return lastTick.updateAndGet(x -> {
+        long now = System.currentTimeMillis();
+        if (now <= x) {
+          // System clock hasn't moved forward. Increment our notion of 
current tick to
+          // keep the time advancing.
+          return x + 1;
+        }
+        // System clock has moved ahead of our notion of current tick. Move us 
forward
+        // to match.
+        return now;
+      });
+    }
{code}

HRegion allocates a clock, and uses {{currentTimeAdvancing}} when processing a 
mutation. {{currentTimeAdvancing}} calls {{System.currentTimeMillis}} and will 
typically just pass the returned value back, first updating a local atomic to 
track what was just returned. Technically we update our local atomic and return 
that. The typical case is expected to be that the system clock is ahead of our 
tracked last returned value. Will instrument to determine the percentage of 
calls where this is true. If for some reason the clock has not advanced we 
increment our local atomic and return that. We do that until system time has 
moved ahead of our notion of the current "millis". {{AtomicLong#updateAndGet}} 
executes our update function atomically. 

The overhead introduced here is only the new atomic. I think this is about as 
small as we can get it, but I leave open the possibility of alternate 
implementations and even include an alternate that spin waits (although, still, 
we need an atomic) that I will use to compare overheads while 
microbenchmarking. 

One concern here is if once we keep a local notion of time per region, and say 
we have some huge burst of calls to get the time, and so we get ahead of the 
system clock by more than a handful of increments, for how long will we drift 
ahead of what the real time is before the system clock is finally ahead of us 
again? I think this "poor man's hybrid clock" scheme is probably fine but will 
think it through, especially when writing unit tests.

The spin wait alternative looks like this:
{code:java}
+    @Override
+    public long currentTimeAdvancing() throws InterruptedException {
+      long now;
+      do {
+        now = System.currentTimeMillis();
+      } while (lastTick.get() <= now);
+      lastTick.set(now);
+      return now;
+    }
{code}

This doesn't have the potential drawback mentioned above. We just wait for the 
clock to tick over. If microbenchmarks indicate spinning is good enough, we can 
opt for this to avoid it.

We can also yield when spinning:
The spin wait alternative looks like this:
{code:java}
+    @Override
+    public long currentTimeAdvancing() throws InterruptedException {
+      long now;
+      do {
+        now = System.currentTimeMillis();
+        Thread.sleep(1,0); // black magic to ensure a yield on all platforms
+      } while (lastTick.get() <= now);
+      lastTick.set(now);
+      return now;
+    }
{code}

Will microbenchmark among the alternatives. 



was (Author: apurtell):
Once HBASE-25911 is committed I can open a PR for this one. Dropping a WIP 
patch here for now. This is the essence of the change:
{code:java}
+    @Override
+    public long currentTimeAdvancing() throws InterruptedException {
+      return lastTick.updateAndGet(x -> {
+        long now = System.currentTimeMillis();
+        if (now <= x) {
+          // System clock hasn't moved forward. Increment our notion of 
current tick to
+          // keep the time advancing.
+          return x + 1;
+        }
+        // System clock has moved ahead of our notion of current tick. Move us 
forward
+        // to match.
+        return now;
+      });
+    }
{code}

HRegion allocates a clock, and uses {{currentTimeAdvancing}} when processing a 
mutation. {{currentTimeAdvancing}} calls {{System.currentTimeMillis}} and will 
typically just pass the returned value back, first updating a local atomic to 
track what was just returned. Technically we update our local atomic and return 
that. The typical case is expected to be that the system clock is ahead of our 
tracked last returned value. Will instrument to determine the percentage of 
calls where this is true. If for some reason the clock has not advanced we 
increment our local atomic and return that. We do that until system time has 
moved ahead of our notion of the current "millis". {{AtomicLong#updateAndGet}} 
executes our update function atomically. 

The overhead introduced here is only the new atomic. I think this is about as 
small as we can get it, but I leave open the possibility of alternate 
implementations and even include an alternate that spin waits (although, still, 
we need an atomic) that I will use to compare overheads while 
microbenchmarking. 

One concern here is if once we keep a local notion of time per region, and say 
we have some huge burst of calls to get the time, and so we get ahead of the 
system clock by more than a handful of increments, for how long will we drift 
ahead of what the real time is before the system clock is finally ahead of us 
again? I think this "poor man's hybrid clock" scheme is probably fine but will 
think it through, especially when writing unit tests.

The spin wait alternative looks like this:
{code:java}
+    @Override
+    public long currentTimeAdvancing() throws InterruptedException {
+      long now;
+      do {
+        now = System.currentTimeMillis();
+      } while (lastTick.get() <= now);
+      lastTick.set(now);
+      return now;
+    }
{code}

This doesn't have the potential drawback mentioned above. We just wait for the 
clock to tick over. If microbenchmarks indicate spinning is good enough, we can 
opt for this to avoid it.


> Introduce EnvironmentEdge.currentTimeAdvancing
> ----------------------------------------------
>
>                 Key: HBASE-25913
>                 URL: https://issues.apache.org/jira/browse/HBASE-25913
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0
>
>         Attachments: 
> 0001-HBASE-25913-Introduce-EnvironmentEdge.currentTimeAdv.patch
>
>
> Introduce new {{EnvironmentEdge#currentTimeAdvancing}} which ensures that 
> when the current time is returned, it is the current time in a different 
> clock tick from the last time the {{EnvironmentEdge}} was used to get the 
> current time.
> When processing mutations we substitute the {{Long.MAX_VALUE}} timestamp 
> placeholder with a real placeholder just before committing the mutation. The 
> current code gets the current time for timestamp substitution while under row 
> lock and mvcc. We will simply use {{EnvironmentEdge#currentTimeAdvancing}} 
> instead of {{EnvironmentEdge#currentTime}} at this point in the code to 
> ensure we have seen the clock tick over. When processing a batch of mutations 
> (doMiniBatchMutation etc) we will call {{currentTimeAdvancing}} only once. 
> This means the client cannot bundle cells with wildcard timestamps into a 
> batch where those cells must be committed with different timestamps. Clients 
> must simply not submit mutations that must be committed with guaranteed 
> distinct timestamps in the same batch. Easy to understand, easy to document, 
> and it aligns with our design philosophy of the client knows best.
> It is not required to handle batches as proposed. We could guarantee a 
> distinct timestamp for every mutation in a batch. Count the number of 
> mutations, call this M. Acquire all row locks and get the current time. Then, 
> wait for at least M milliseconds. Then, set the first mutation timestamp with 
> this value and increment by 1 for all remaining. Then, do the rest of 
> mutation processing as normal. I don't think this extra waiting to reserve 
> the range of timestamps is necessary. See reasoning in above paragraph. 
> Mentioned here for sake of discussion.
> It will be fine to continue to use {{EnvironmentEdge#currentTime}} everywhere 
> else. In this way we will only potentially spin wait where it matters, and 
> won't suffer serious overheads during batch processing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to