[ https://issues.apache.org/jira/browse/ACCUMULO-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316456#comment-15316456 ]
Michael Wall commented on ACCUMULO-4157: ---------------------------------------- I merged (but didn't push) my fix for ACCUMULO-4157 to 1.6 and then 1.7. The only difference in GarbageCollectWriteAheadLog.java between 1.6 and 1.7 was the addition of replication. But when I got to 1.8. that class has changed quite a bit. Mostly from Eric's ticket https://issues.apache.org/jira/browse/ACCUMULO-3423. I am trying to determine if the issue I was fixing is still an issue in 1.8, where the GCWAL simply deletes a WALOG for a downed tserver. I see this code is still present, which was the problem in 1.6. See https://github.com/apache/accumulo/blob/1.8/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L241 {code} try { if (!useTrash || !fs.moveToTrash(path)) { fs.deleteRecursively(path); } status.currentLog.deleted++; } catch (FileNotFoundException ex) { // ignored } catch (IOException ex) { log.error("Unable to delete wal " + path + ": " + ex); } {code} But in 1.6 and 1.7, that was only called for a WAL from a "dead" tserver. Still working through what gets to this code. Anyone have any thoughts on this? [~elserj] [~kturner] > WAL can be prematurely deleted > ------------------------------ > > Key: ACCUMULO-4157 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4157 > Project: Accumulo > Issue Type: Bug > Components: gc > Affects Versions: 1.6.5, 1.7.1 > Reporter: Michael Wall > Assignee: Michael Wall > Priority: Blocker > Fix For: 1.6.6, 1.7.2, 1.8.0 > > Time Spent: 5h > Remaining Estimate: 0h > > Ran into a situation where the master started logging [Unable to initiate log > sort|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103] > because the WAL could not be found right after a tserver died. The WAL was > for a tablet in the metadata table, with a key extent like > *!0;t;endrow;prevrow* hosted by a tabletserver like *tserver1*. Doing a log > sort happens in the master before a tablet can be reassigned and brought back > online. Accumulo is in a really bad state when this happens as that tablet > will stay unhosted until manual intervention. Luckily, the WAL was in the > HDFS trash and could be copied back to the expected location. > Piecing together the logs showed something like the following > * 13:30:36 *tserver1* minor compacted for that extent [MinC finish lock 0.00 > secs > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L938], > maybe removing the log entry from the metadata table > * 13:37:01 *tserver1* data written, [Adding 1 logs for extent > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131], > maybe adding back that log entry to the metadata table > * 13:38:58 *master* thinks the server went down, [Lost servers > tserver1:9997|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1199], > but it is really still up for another 8 mins > * 13:38:58 *master* identifies WAL to be recovered [Recovering > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L147] > * 13:38:58 *master* [Loaded class : > org.apache.accumulo.server.master.recovery.HadoopLogCloser|https://github.com/apache/accumulo/blob/rel/1.6.5/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L239] > * 13:38:58 *master* [Started recovery of > hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a > reference|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L181] > * 13:39:13 *master* [Waiting for file to be closed > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/base/src/main/java/org/apache/accumulo/server/master/recovery/HadoopLogCloser.java#L55] > * 13:39:16 *gc* cleaning up WAL doesn't see reference, thinks server is > down so wacks the WAL and logs [Removing WAL for offline server > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L194] > * 13:39:18 *master* gets FileNotFoundException for that WAL, [Unable to > initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID | > https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103]. > Accumulo is in a bad state from here until manual intervention. > * 13:44:16 *tserver1* more data written, [Adding 1 logs for extent > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131] > * 13:45:45 *tserver1* finally dies after logging [Lost table server lock > (reason = SESSION_EXPIRED) > |https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3219] > This suggests that the GargabeCollectWriteAheadLog is too aggressively > removing WALs for a server it thinks is dead but may actually still be doing > work. The tserver was under heavy load before it went down. > Studying the logs with [~kturner] and brainstorming, here are some things > that could be fixed/checked > # When gc doesn't see a reference to a WAL in metadata table, it asks the > tablet server to delete the log. The gc process then logs at DEBUG that the > WAL was deleted regardless of whether it was or not. Maybe change log to > "asking tserver to delete WAL" or something. We found these messages in the > gc log 45 minutes before this event. These messages were misleading because > further investigation shows the tserver will log [Deleting wal | > https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2607] > when a WAL is truly deleted. There were not such message in the tserver 45 > min earlier, indicating the WAL was not actually deleted. > # GC logs "Removing WAL for offline" at DEBUG. These can roll off pretty > quickly, so change that to INFO. This will help keep history around longer > to aid troubleshooting. > # Verify the "adding 1 logs for extent" is using the srv:lock column to > enforce the constraint. Looks like it is, but if zooLock is null in the > update of MetadataTableUtil maybe badness is happening. > # In GC, maybe keep a map of first time we see a tablet server is down and > don't actually remove the WAL for offline tablet servers until they have been > down an hour or something. Would need to make sure that map is cleared when > a tserver comes back online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)