[ 
https://issues.apache.org/jira/browse/ACCUMULO-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828132#comment-13828132
 ] 

Luke Brassard commented on ACCUMULO-1830:
-----------------------------------------

We saw this behavior in a 1.5.x build and thought it would be useful to share 
some insight into how we think this occurred for others that may have similar 
issues.

Here's a breakdown of the events that took place:

{noformat:nopanel=true}
Timeline:

20:15 i.rf created                              // A
20:20 i.rf compacted away to j.rf     
20:25 i.rf deleted and references updated       // B
??:?? Minor compaction cleans up old walogs     // C
??:?? CLUSTER CRASH
22:20 recovery
22:22 missing file (i.rf) reported

In pictures:    
    
            WAL-1                       RootTablet
            +---------------+           +---------------+
        ?   | file:i.rf     |<---[A]--->| file:i.rf     |
       /    |               |        +->| del:file:i.rf |
      /     +---------------+       /   +---------------+
   [C]                             /
      \     WAL-2                [B]
       \    +---------------+    /
        \+-X| del:file:i.rf | <-+
            |               |
            +---------------+

A: 'i' reference written to RootTablet and WAL-1
B: 'i' delete marker written to RootTablet and WAL-2
    - at this point, there is a delete marker in the RootTablet and in a WAL
C: After compaction, Accumulo cleans up WAL-1 and WAL-2 but is 
   interrupted by failure and WAL-1 is left behind
    
{noformat}

When restarted, the 'file:i.rf' record is recovered and added to the 
RootTablet, which is telling Accumulo that a file exists, even though it was 
deleted before the crash happened.
    
A lack of atomicity in the cleanup of walogs seems to be the cause of this 
behavior.

> illegal state in RestartStressIT
> --------------------------------
>
>                 Key: ACCUMULO-1830
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1830
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>    Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.4.4, 1.5.0
>         Environment: on master, 135e67b68592f0d1c7ca69bac318a7ad3ed55831
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>            Priority: Critical
>             Fix For: 1.4.5, 1.5.1
>
>
> {noformat}
> 2013-10-29 15:20:11,125 [state.MetaDataTableScanner] ERROR: 
> java.lang.RuntimeException: 
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
>  found two locations for the same extent 1<: host:50867[14205a7c2a90003] and 
> host:41255[14205a7c2a9000a]
> java.lang.RuntimeException: 
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
>  found two locations for the same extent 1<: host[14205a7c2a90003] and 
> host:41255[14205a7c2a9000a]
>         at 
> org.apache.accumulo.server.master.state.MetaDataTableScanner.fetch(MetaDataTableScanner.java:189)
>         at 
> org.apache.accumulo.server.master.state.MetaDataTableScanner.next(MetaDataTableScanner.java:124)
>         at 
> org.apache.accumulo.server.master.state.MetaDataTableScanner.next(MetaDataTableScanner.java:1)
>         at 
> org.apache.accumulo.server.master.TabletGroupWatcher.run(TabletGroupWatcher.java:143)
> Caused by: 
> org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException:
>  found two locations for the same extent 1<: host:50867[14205a7c2a90003] and 
> host:41255[14205a7c2a9000a]
>         at 
> org.apache.accumulo.server.master.state.MetaDataTableScanner.createTabletLocationState(MetaDataTableScanner.java:157)
>         at 
> org.apache.accumulo.server.master.state.MetaDataTableScanner.fetch(MetaDataTableScanner.java:185)
>         ... 3 more
> {noformat}
> Here's where the test stopped 
> {noformat}
> java.lang.IllegalStateException: Tablet has multiple locations : 1<
>       at 
> org.apache.accumulo.core.metadata.MetadataLocationObtainer.getMetadataLocationEntries(MetadataLocationObtainer.java:233)
>       at 
> org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:118)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:462)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:619)
>       at 
> org.apache.accumulo.core.client.impl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:437)
>       at 
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:226)
>       at 
> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:84)
>       at 
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:177)
>       at 
> org.apache.accumulo.test.VerifyIngest.verifyIngest(VerifyIngest.java:162)
>       at 
> org.apache.accumulo.test.functional.RestartStressIT.test(RestartStressIT.java:73)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to