[ 
https://issues.apache.org/jira/browse/ACCUMULO-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Wall updated ACCUMULO-4157:
-----------------------------------
    Description: 
Ran into a situation where the master started logging [Unable to initiate log 
sort|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103]
 because the WAL could not be found right after a tserver died.  The WAL was 
for a tablet in the metadata table, with a key extent like 
*!0;t;endrow;prevrow* hosted by a tabletserver like *tserver1*.  Doing a log 
sort happens in the master before a tablet can be reassigned and brought back 
online.  Accumulo is in a really bad state when this happens as that tablet 
will stay unhosted until manual intervention.  Luckily, the WAL was in the HDFS 
trash and could be copied back to the expected location.

Piecing together the logs showed something like the following
* 13:30:36 *tserver1* minor compacted for that extent [MinC finish lock 0.00 
secs 
!0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L938],
 maybe removing the log entry from the metadata table
* 13:37:01 *tserver1* data written, [Adding 1 logs for extent 
!0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131],
 maybe adding back that log entry to the metadata table
* 13:38:58 *master*   thinks the server went down, [Lost servers 
tserver1:9997|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1199],
 but it is really still up for another 8 mins
* 13:38:58 *master*   identifies WAL to be recovered [Recovering 
hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L147]
* 13:38:58 *master*   [Loaded class : 
org.apache.accumulo.server.master.recovery.HadoopLogCloser|https://github.com/apache/accumulo/blob/rel/1.6.5/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L239]
* 13:38:58 *master*   [Started recovery of 
hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a 
reference|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L181]
* 13:39:13 *master*   [Waiting for file to be closed 
hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/base/src/main/java/org/apache/accumulo/server/master/recovery/HadoopLogCloser.java#L55]
* 13:39:16 *gc*       cleaning up WAL doesn't see reference, thinks server is 
down so wacks the WAL and logs [Removing WAL for offline server 
hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L194]
* 13:39:18 *master*   gets FileNotFoundException for that WAL, [Unable to 
initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID | 
https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103].
  Accumulo is in a bad state from here until manual intervention.
* 13:44:16 *tserver1* more data written, [Adding 1 logs for extent 
!0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131]
* 13:45:45 *tserver1* finally dies after logging [Lost table server lock 
(reason = SESSION_EXPIRED) 
|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3219]

This suggests that the GargabeCollectWriteAheadLog is too aggressively removing 
WALs for a server it thinks is dead but may actually still be doing work.  The 
tserver was under heavy load before it went down.

Studying the logs with [~kturner] and brainstorming, here are some things that 
could be fixed/checked
# When gc doesn't see a reference to a WAL in metadata table, it asks the 
tablet server to delete the log.  The gc process then logs at DEBUG that the 
WAL was deleted regardless of whether it was or not.  Maybe change log to 
"asking tserver to delete WAL" or something.  We found these messages in the gc 
log 45 minutes before this event.  These messages were misleading because 
further investigation shows the tserver will log [Deleting wal | 
https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2607]
 when a WAL is truly deleted.  There were not such message in the tserver 45 
min earlier, indicating the WAL was not actually deleted.
# GC logs "Removing WAL for offline" at DEBUG.  These can roll off pretty 
quickly, so change that to INFO.  This will help keep history around longer to 
aid troubleshooting.
# Verify the "adding 1 logs for extent" is using the srv:lock column to enforce 
the constraint.  Looks like it is, but if zooLock is null in the update of 
MetadataTableUtil maybe badness is happening.
# In GC, maybe keep a map of first time we see a tablet server is down and 
don't actually remove the WAL for offline tablet servers until they have been 
down an hour or something.  Would need to make sure that map is cleared when a 
tserver comes back online.

  was:More details to follow


> WAL can be prematurely deleted
> ------------------------------
>
>                 Key: ACCUMULO-4157
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4157
>             Project: Accumulo
>          Issue Type: Bug
>          Components: gc
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Michael Wall
>            Assignee: Michael Wall
>
> Ran into a situation where the master started logging [Unable to initiate log 
> sort|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103]
>  because the WAL could not be found right after a tserver died.  The WAL was 
> for a tablet in the metadata table, with a key extent like 
> *!0;t;endrow;prevrow* hosted by a tabletserver like *tserver1*.  Doing a log 
> sort happens in the master before a tablet can be reassigned and brought back 
> online.  Accumulo is in a really bad state when this happens as that tablet 
> will stay unhosted until manual intervention.  Luckily, the WAL was in the 
> HDFS trash and could be copied back to the expected location.
> Piecing together the logs showed something like the following
> * 13:30:36 *tserver1* minor compacted for that extent [MinC finish lock 0.00 
> secs 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L938],
>  maybe removing the log entry from the metadata table
> * 13:37:01 *tserver1* data written, [Adding 1 logs for extent 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131],
>  maybe adding back that log entry to the metadata table
> * 13:38:58 *master*   thinks the server went down, [Lost servers 
> tserver1:9997|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1199],
>  but it is really still up for another 8 mins
> * 13:38:58 *master*   identifies WAL to be recovered [Recovering 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L147]
> * 13:38:58 *master*   [Loaded class : 
> org.apache.accumulo.server.master.recovery.HadoopLogCloser|https://github.com/apache/accumulo/blob/rel/1.6.5/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L239]
> * 13:38:58 *master*   [Started recovery of 
> hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a 
> reference|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L181]
> * 13:39:13 *master*   [Waiting for file to be closed 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/base/src/main/java/org/apache/accumulo/server/master/recovery/HadoopLogCloser.java#L55]
> * 13:39:16 *gc*       cleaning up WAL doesn't see reference, thinks server is 
> down so wacks the WAL and logs [Removing WAL for offline server 
> hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L194]
> * 13:39:18 *master*   gets FileNotFoundException for that WAL, [Unable to 
> initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID | 
> https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103].
>   Accumulo is in a bad state from here until manual intervention.
> * 13:44:16 *tserver1* more data written, [Adding 1 logs for extent 
> !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131]
> * 13:45:45 *tserver1* finally dies after logging [Lost table server lock 
> (reason = SESSION_EXPIRED) 
> |https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3219]
> This suggests that the GargabeCollectWriteAheadLog is too aggressively 
> removing WALs for a server it thinks is dead but may actually still be doing 
> work.  The tserver was under heavy load before it went down.
> Studying the logs with [~kturner] and brainstorming, here are some things 
> that could be fixed/checked
> # When gc doesn't see a reference to a WAL in metadata table, it asks the 
> tablet server to delete the log.  The gc process then logs at DEBUG that the 
> WAL was deleted regardless of whether it was or not.  Maybe change log to 
> "asking tserver to delete WAL" or something.  We found these messages in the 
> gc log 45 minutes before this event.  These messages were misleading because 
> further investigation shows the tserver will log [Deleting wal | 
> https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2607]
>  when a WAL is truly deleted.  There were not such message in the tserver 45 
> min earlier, indicating the WAL was not actually deleted.
> # GC logs "Removing WAL for offline" at DEBUG.  These can roll off pretty 
> quickly, so change that to INFO.  This will help keep history around longer 
> to aid troubleshooting.
> # Verify the "adding 1 logs for extent" is using the srv:lock column to 
> enforce the constraint.  Looks like it is, but if zooLock is null in the 
> update of MetadataTableUtil maybe badness is happening.
> # In GC, maybe keep a map of first time we see a tablet server is down and 
> don't actually remove the WAL for offline tablet servers until they have been 
> down an hour or something.  Would need to make sure that map is cleared when 
> a tserver comes back online.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to