[ 
https://issues.apache.org/jira/browse/HBASE-27487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634969#comment-17634969
 ] 

Briana Augenreich edited comment on HBASE-27487 at 11/16/22 6:40 PM:
---------------------------------------------------------------------

In order to patch the AsyncProcess I can create an 
{{OpertationTimeoutExceededException}} which extends {{{}HBaseIOException and 
add this to the list of special exceptions in ClientExceptionsUtil{}}}. 

 


was (Author: JIRAUSER283873):
In order to patch the AsyncProcess I can create an 
{{OpertationTimeoutExceededException}} which extends {{HBaseIOException }}and 
add this to the list of special exceptions in {{{}ClientExceptionsUtil{}}}. 

 

> Slow meta can create pathological feedback loop with multigets
> --------------------------------------------------------------
>
>                 Key: HBASE-27487
>                 URL: https://issues.apache.org/jira/browse/HBASE-27487
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.5.1, 2.4.15
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> This only affects the Table implementation in 2.x releases.
> h4. Call stack
> When Table.batch is called, an AsyncProcessTask is created with 
> SubmittedRows.ALL, which is sent to AsyncProcess.submit(). For the ALL case, 
> this goes to submitAll which creates an AsyncRequestFutureImpl and then calls 
> groupAndSendMultiAction on that.
> When a AsyncRequestFutureImpl is created, a RetryingTimeTracker is created 
> and started as the last step of the constructor.
> In groupAndSendMultiAction, the first thing that has to be done is resolve 
> the HRegionLocation for every action in the batch. This is currently done 
> sequentially, with no timeout on the overall batch completion.
> Once all actions have been resolved, they are passed into sendMultiAction 
> which creates a SingleServerRequestRunnable. Once that runnable is executed, 
> the first thing it does is create a new MultiServerCallable using the same 
> RetryingTimeTracker that was originally created way back.
> That callable extends CancellableRegionServerCallable, and the call method 
> first checks the tracker.getRemainingTime() before actually doing any work. 
> If exceeded, it throws an exception.
> h4. Problem
> If meta is overloaded, or you send any sufficiently large batch of actions, 
> the resolving of HRegionLocations (which happens sequentially) may take a 
> while.
> Depending on the operation timeout configured for the client, that duration 
> may already exceed that timeout before even reaching the 
> CancellableRegionServerCallable.call().
> When the timeout is exceeded there, a DoNotRetryIOException is thrown. This 
> is considered a cache clearing exception, so any locations that may have been 
> slowly resolved earlier up the chain will be thrown away.
> If done with enough concurrency, this can create a feedback loop that is 
> impossible to recover from.
> h4. Potential Solutions
>  # Change the thrown exception type from DoNotRetryIOException to something 
> more appropriate for the actual error (some sort of timeout exception). We'd 
> have to make that exception a "special" exception in ClientExceptionUtil so 
> that it doesn't clear the cache.
>  # Make DoNotRetryIOException itself a "special" exception. The point of 
> clearing cache is to make retries more likely to succeed if the failure was 
> related to a wrong location. But DoNotRetryIOException explicitly is not 
> supposed to be retried, so you might think it shouldn't clear the cache as 
> well. There are many usages of this exception, so it's hard to say for sure 
> that this would be universally safe.
>  # Reset the RetryingTimeTracker after resolving region locations.
> I think I'd lean towards option 1, because it seems odd to say "don't retry 
> in that case". In fact, retrying should be more likely to succeed because 
> locations will have been resolved.
> Whichever we choose, I think we should additionally check the timeout in 
> groupAndSendMultiAction after resolving each region location. We should not 
> allow that process to exceed timeouts and currently it can way more than 
> exceed them before finally being checked incidentally at the end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to