[ https://issues.apache.org/jira/browse/HBASE-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087093#comment-13087093 ]
Andrew Purtell commented on HBASE-4222: --------------------------------------- +1 We've tested this on EC2 clusters and it works. > Make HLog more resilient to write pipeline failures > --------------------------------------------------- > > Key: HBASE-4222 > URL: https://issues.apache.org/jira/browse/HBASE-4222 > Project: HBase > Issue Type: Improvement > Components: wal > Reporter: Gary Helmling > Assignee: Gary Helmling > Fix For: 0.92.0 > > > The current implementation of HLog rolling to recover from transient errors > in the write pipeline seems to have two problems: > # When {{HLog.LogSyncer}} triggers an {{IOException}} during time-based sync > operations, it triggers a log rolling request in the corresponding catch > block, but only after escaping from the internal while loop. As a result, > the {{LogSyncer}} thread will exit and never be restarted from what I can > tell, even if the log rolling was successful. > # Log rolling requests triggered by an {{IOException}} in {{sync()}} or > {{append()}} never happen if no entries have yet been written to the log. > This means that write errors are not immediately recovered, which extends the > exposure to more errors occurring in the pipeline. > In addition, it seems like we should be able to better handle transient > problems, like a rolling restart of DataNodes while the HBase RegionServers > are running. Currently this will reliably cause RegionServer aborts during > log rolling: either an append or time-based sync triggers an initial > {{IOException}}, initiating a log rolling request. However the log rolling > then fails in closing the current writer ("All datanodes are bad"), causing a > RegionServer abort. In this case, it seems like we should at least allow you > an option to continue with the new writer and only abort on subsequent errors. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira