[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

stack (JIRA) Wed, 02 Sep 2015 10:21:06 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727661#comment-14727661
 ]


stack commented on HBASE-14317:
-------------------------------

testFlushMarkersWALFail failure was interesting. The Mockito matcher was 
failing for me. Fields were null.  Undid the mockito matcher for this test.

The second issue has to do with sloppy semantics. Previous, you could have 
append throw an exception and then a sync could go in and succeed. You could 
then carry on using the WAL as though no exception had been thrown.

This patch hardens our semantic such that once a WAL throws an exception, no 
new appends or syncs will succeed, not untill you replace the WAL. For 
testFlushMarkersWALFail, because there is no log rolling thread running, we'd 
just hang making no progress because the WAL had gone bad. I added forced log 
rolling after each test step. Also fixed weird stuff like this:

{code}
         } catch (IOException ioe) {
-          LOG.warn("Unexpected exception while wal.sync(), ignoring. 
Exception: "
-              + StringUtils.stringifyException(ioe));
+          wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
+          throw ioe;
         }
{code}

See how we used to ignore a failed sync, just log it at WARN.

One implication of the new hardening of the semantic is that the dodgy getting 
of a sequenceid by adding an 'empty append' now fails if the WAL is bad. A log 
roll will fix it. I've been seeing some of this in tests and fix is to add in a 
log roll (in a server, we have the log rolling thread running... not in tests 
of regions only).

> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
>         Attachments: 14317.test.txt, 14317v10.txt, 14317v5.branch-1.2.txt, 
> 14317v5.txt, 14317v9.txt, HBASE-14317-v1.patch, HBASE-14317-v2.patch, 
> HBASE-14317-v3.patch, HBASE-14317-v4.patch, HBASE-14317.patch, [Java] RS 
> stuck on WAL sync to a dead DN - Pastebin.com.html, append-only-test.patch, 
> raw.php, repro.txt, san_dump.txt, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. 
> See attached thread dump and associated log. What is interesting is that 
> syncers are waiting to take syncs to run and at same time we want to flush so 
> we are waiting on a safe point but there seems to be nothing in our ring 
> buffer; did we go to roll log and not add safe point sync to clear out 
> ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL

Reply via email to