[jira] [Commented] (HBASE-26552) Introduce retry to logroller when encounters IOException

2022-01-17 Thread Xiaolin Ha (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477624#comment-17477624
 ] 

Xiaolin Ha commented on HBASE-26552:


Thanks your advice, [~anoop.hbase]. I think it is possible to add retries when 
creating new writer instance in AbstractFSWAL,
{code:java}
Path newPath = getNewPath();
// Any exception from here on is catastrophic, non-recoverable so we currently 
abort.
W nextWriter = this.createWriterInstance(newPath); {code}
When the new writer created failed, just close it and make configurable retry.

The close of the failed new writer and the old wal should use the same logics 
in close writer, which just swallows and warns the IOExceptions. 

> Introduce retry to logroller when encounters IOException
> 
>
> Key: HBASE-26552
> URL: https://issues.apache.org/jira/browse/HBASE-26552
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When calling RollController#rollWal in AbstractWALRoller, the regionserver 
> may abort when encounters exception,
> {code:java}
> ...
> } catch (FailedLogCloseException | ConnectException e) {
>   abort("Failed log close in log roller", e);
> } catch (IOException ex) {
>   // Abort if we get here. We probably won't recover an IOE. HBASE-1132
>   abort("IOE in log roller",
> ex instanceof RemoteException ? ((RemoteException) 
> ex).unwrapRemoteException() : ex);
> } catch (Exception ex) {
>   LOG.error("Log rolling failed", ex);
>   abort("Log rolling failed", ex);
> } {code}
> I think we should support retry of rollWal here to avoid recovering the 
> service by killing regionserver. The restart of regionserver is costly and 
> very not friendly to the availability.
> I find that when creating new writer for the WAL in 
> FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to 
> addBlock by setting this config "hbase.fs.async.create.retries". The idea of 
> retry to roll WAL is similar to it, they both try best to make roll WAL 
> succeed. 
> But the initialization of new WAL writer also includes flushing the write 
> buffer flush and waiting until it is completed by 
> AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some 
> hardware reasons. The regionserver connected to the datanodes after addBlock, 
> but that not means the magic and header can be flushed successfully.
> {code:java}
> protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws 
> IOException {
>   return write(future -> {
> output.write(magic);
> try {
>   header.writeDelimitedTo(asyncOutputWrapper);
> } catch (IOException e) {
>   // should not happen
>   throw new AssertionError(e);
> }
> addListener(output.flush(false), (len, error) -> {
>   if (error != null) {
> future.completeExceptionally(error);
>   } else {
> future.complete(len);
>   }
> });
>   });
> }{code}
> We have found that in our production clusters, there exists aborting of 
> regionservers that caused by "IOE in log roller". And the practice in our 
> clusters is that just one more retry of rollWal can make the WAL roll 
> complete and continue serving.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26552) Introduce retry to logroller when encounters IOException

2022-01-17 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477558#comment-17477558
 ] 

Anoop Sam John commented on HBASE-26552:


Can u give a detailed proposal pls?  Like different cases in roll how we react? 
 What if the old log close failed? There seems some logic within it and configs 
to decide whether that close time issue can be ignored or not.  So case by case 
proposal can help. Good one. Fully agree that if we can avoid abort that can 
help a lot.  

> Introduce retry to logroller when encounters IOException
> 
>
> Key: HBASE-26552
> URL: https://issues.apache.org/jira/browse/HBASE-26552
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Affects Versions: 3.0.0-alpha-1, 2.0.0
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Major
>
> When calling RollController#rollWal in AbstractWALRoller, the regionserver 
> may abort when encounters exception,
> {code:java}
> ...
> } catch (FailedLogCloseException | ConnectException e) {
>   abort("Failed log close in log roller", e);
> } catch (IOException ex) {
>   // Abort if we get here. We probably won't recover an IOE. HBASE-1132
>   abort("IOE in log roller",
> ex instanceof RemoteException ? ((RemoteException) 
> ex).unwrapRemoteException() : ex);
> } catch (Exception ex) {
>   LOG.error("Log rolling failed", ex);
>   abort("Log rolling failed", ex);
> } {code}
> I think we should support retry of rollWal here to avoid recovering the 
> service by killing regionserver. The restart of regionserver is costly and 
> very not friendly to the availability.
> I find that when creating new writer for the WAL in 
> FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to 
> addBlock by setting this config "hbase.fs.async.create.retries". The idea of 
> retry to roll WAL is similar to it, they both try best to make roll WAL 
> succeed. 
> But the initialization of new WAL writer also includes flushing the write 
> buffer flush and waiting until it is completed by 
> AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some 
> hardware reasons. The regionserver connected to the datanodes after addBlock, 
> but that not means the magic and header can be flushed successfully.
> {code:java}
> protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws 
> IOException {
>   return write(future -> {
> output.write(magic);
> try {
>   header.writeDelimitedTo(asyncOutputWrapper);
> } catch (IOException e) {
>   // should not happen
>   throw new AssertionError(e);
> }
> addListener(output.flush(false), (len, error) -> {
>   if (error != null) {
> future.completeExceptionally(error);
>   } else {
> future.complete(len);
>   }
> });
>   });
> }{code}
> We have found that in our production clusters, there exists aborting of 
> regionservers that caused by "IOE in log roller". And the practice in our 
> clusters is that just one more retry of rollWal can make the WAL roll 
> complete and continue serving.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)