[ 
https://issues.apache.org/jira/browse/HBASE-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087093#comment-13087093
 ] 

Andrew Purtell commented on HBASE-4222:
---------------------------------------

+1 We've tested this on EC2 clusters and it works.

> Make HLog more resilient to write pipeline failures
> ---------------------------------------------------
>
>                 Key: HBASE-4222
>                 URL: https://issues.apache.org/jira/browse/HBASE-4222
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Gary Helmling
>            Assignee: Gary Helmling
>             Fix For: 0.92.0
>
>
> The current implementation of HLog rolling to recover from transient errors 
> in the write pipeline seems to have two problems:
> # When {{HLog.LogSyncer}} triggers an {{IOException}} during time-based sync 
> operations, it triggers a log rolling request in the corresponding catch 
> block, but only after escaping from the internal while loop.  As a result, 
> the {{LogSyncer}} thread will exit and never be restarted from what I can 
> tell, even if the log rolling was successful.
> # Log rolling requests triggered by an {{IOException}} in {{sync()}} or 
> {{append()}} never happen if no entries have yet been written to the log.  
> This means that write errors are not immediately recovered, which extends the 
> exposure to more errors occurring in the pipeline.
> In addition, it seems like we should be able to better handle transient 
> problems, like a rolling restart of DataNodes while the HBase RegionServers 
> are running.  Currently this will reliably cause RegionServer aborts during 
> log rolling: either an append or time-based sync triggers an initial 
> {{IOException}}, initiating a log rolling request.  However the log rolling 
> then fails in closing the current writer ("All datanodes are bad"), causing a 
> RegionServer abort.  In this case, it seems like we should at least allow you 
> an option to continue with the new writer and only abort on subsequent errors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to