[ 
https://issues.apache.org/jira/browse/ACCUMULO-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308561#comment-15308561
 ] 

Michael Wall commented on ACCUMULO-4157:
----------------------------------------

This was found on a live cluster.  It has been there for a long time.

> WAL can be prematurely deleted
> ------------------------------
>
>                 Key: ACCUMULO-4157
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4157
>             Project: Accumulo
>          Issue Type: Bug
>          Components: gc
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Michael Wall
>            Assignee: Michael Wall
>            Priority: Blocker
>             Fix For: 1.6.6, 1.7.2, 1.8.0
>
>
> Ran into a situation where the master started logging [Unable to initiate log 
> sort|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103]
>  because the WAL could not be found right after a tserver died.  The WAL was 
> for a tablet in the metadata table, with a key extent like 
> *!0;t;endrow;prevrow* hosted by a tabletserver like *tserver1*.  Doing a log 
> sort happens in the master before a tablet can be reassigned and brought back 
> online.  Accumulo is in a really bad state when this happens as that tablet 
> will stay unhosted until manual intervention.  Luckily, the WAL was in the 
> HDFS trash and could be copied back to the expected location.
> Piecing together the logs showed something like the following
> * 13:30:36 *tserver1* minor compacted for that extent [MinC finish lock 0.00 
> secs 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L938],
>  maybe removing the log entry from the metadata table
> * 13:37:01 *tserver1* data written, [Adding 1 logs for extent 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131],
>  maybe adding back that log entry to the metadata table
> * 13:38:58 *master*   thinks the server went down, [Lost servers 
> tserver1:9997|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1199],
>  but it is really still up for another 8 mins
> * 13:38:58 *master*   identifies WAL to be recovered [Recovering 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L147]
> * 13:38:58 *master*   [Loaded class : 
> org.apache.accumulo.server.master.recovery.HadoopLogCloser|https://github.com/apache/accumulo/blob/rel/1.6.5/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L239]
> * 13:38:58 *master*   [Started recovery of 
> hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a 
> reference|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L181]
> * 13:39:13 *master*   [Waiting for file to be closed 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/base/src/main/java/org/apache/accumulo/server/master/recovery/HadoopLogCloser.java#L55]
> * 13:39:16 *gc*       cleaning up WAL doesn't see reference, thinks server is 
> down so wacks the WAL and logs [Removing WAL for offline server 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L194]
> * 13:39:18 *master*   gets FileNotFoundException for that WAL, [Unable to 
> initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID | 
> https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103].
>   Accumulo is in a bad state from here until manual intervention.
> * 13:44:16 *tserver1* more data written, [Adding 1 logs for extent 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131]
> * 13:45:45 *tserver1* finally dies after logging [Lost table server lock 
> (reason = SESSION_EXPIRED) 
> |https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3219]
> This suggests that the GargabeCollectWriteAheadLog is too aggressively 
> removing WALs for a server it thinks is dead but may actually still be doing 
> work.  The tserver was under heavy load before it went down.
> Studying the logs with [~kturner] and brainstorming, here are some things 
> that could be fixed/checked
> # When gc doesn't see a reference to a WAL in metadata table, it asks the 
> tablet server to delete the log.  The gc process then logs at DEBUG that the 
> WAL was deleted regardless of whether it was or not.  Maybe change log to 
> "asking tserver to delete WAL" or something.  We found these messages in the 
> gc log 45 minutes before this event.  These messages were misleading because 
> further investigation shows the tserver will log [Deleting wal | 
> https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2607]
>  when a WAL is truly deleted.  There were not such message in the tserver 45 
> min earlier, indicating the WAL was not actually deleted.
> # GC logs "Removing WAL for offline" at DEBUG.  These can roll off pretty 
> quickly, so change that to INFO.  This will help keep history around longer 
> to aid troubleshooting.
> # Verify the "adding 1 logs for extent" is using the srv:lock column to 
> enforce the constraint.  Looks like it is, but if zooLock is null in the 
> update of MetadataTableUtil maybe badness is happening.
> # In GC, maybe keep a map of first time we see a tablet server is down and 
> don't actually remove the WAL for offline tablet servers until they have been 
> down an hour or something.  Would need to make sure that map is cleared when 
> a tserver comes back online.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to