[ https://issues.apache.org/jira/browse/ACCUMULO-4157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Busbey updated ACCUMULO-4157: ---------------------------------- Priority: Critical (was: Major) Shouldn't recovering the file have included doing an HDFS lease recovery, which prevents tserver1 from writing more data? > WAL can be prematurely deleted > ------------------------------ > > Key: ACCUMULO-4157 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4157 > Project: Accumulo > Issue Type: Bug > Components: gc > Affects Versions: 1.6.5, 1.7.1 > Reporter: Michael Wall > Assignee: Michael Wall > Priority: Critical > Fix For: 1.6.6, 1.7.2, 1.8.0 > > > Ran into a situation where the master started logging [Unable to initiate log > sort|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103] > because the WAL could not be found right after a tserver died. The WAL was > for a tablet in the metadata table, with a key extent like > *!0;t;endrow;prevrow* hosted by a tabletserver like *tserver1*. Doing a log > sort happens in the master before a tablet can be reassigned and brought back > online. Accumulo is in a really bad state when this happens as that tablet > will stay unhosted until manual intervention. Luckily, the WAL was in the > HDFS trash and could be copied back to the expected location. > Piecing together the logs showed something like the following > * 13:30:36 *tserver1* minor compacted for that extent [MinC finish lock 0.00 > secs > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L938], > maybe removing the log entry from the metadata table > * 13:37:01 *tserver1* data written, [Adding 1 logs for extent > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131], > maybe adding back that log entry to the metadata table > * 13:38:58 *master* thinks the server went down, [Lost servers > tserver1:9997|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/Master.java#L1199], > but it is really still up for another 8 mins > * 13:38:58 *master* identifies WAL to be recovered [Recovering > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L147] > * 13:38:58 *master* [Loaded class : > org.apache.accumulo.server.master.recovery.HadoopLogCloser|https://github.com/apache/accumulo/blob/rel/1.6.5/core/src/main/java/org/apache/accumulo/core/conf/AccumuloConfiguration.java#L239] > * 13:38:58 *master* [Started recovery of > hdfs://accumulo/wal/tserver1+9997/UUID, tablet !0;t;endrow;prevrow holds a > reference|https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L181] > * 13:39:13 *master* [Waiting for file to be closed > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/base/src/main/java/org/apache/accumulo/server/master/recovery/HadoopLogCloser.java#L55] > * 13:39:16 *gc* cleaning up WAL doesn't see reference, thinks server is > down so wacks the WAL and logs [Removing WAL for offline server > hdfs://accumulo/wal/tserver1+9997/UUID|https://github.com/apache/accumulo/blob/rel/1.6.5/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectWriteAheadLogs.java#L194] > * 13:39:18 *master* gets FileNotFoundException for that WAL, [Unable to > initate log sort for hdfs://accumulo/wal/tserver1+9997/UUID | > https://github.com/apache/accumulo/blob/rel/1.6.5/server/master/src/main/java/org/apache/accumulo/master/recovery/RecoveryManager.java#L103]. > Accumulo is in a bad state from here until manual intervention. > * 13:44:16 *tserver1* more data written, [Adding 1 logs for extent > !0;t;endrow;prevrow|https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3131] > * 13:45:45 *tserver1* finally dies after logging [Lost table server lock > (reason = SESSION_EXPIRED) > |https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L3219] > This suggests that the GargabeCollectWriteAheadLog is too aggressively > removing WALs for a server it thinks is dead but may actually still be doing > work. The tserver was under heavy load before it went down. > Studying the logs with [~kturner] and brainstorming, here are some things > that could be fixed/checked > # When gc doesn't see a reference to a WAL in metadata table, it asks the > tablet server to delete the log. The gc process then logs at DEBUG that the > WAL was deleted regardless of whether it was or not. Maybe change log to > "asking tserver to delete WAL" or something. We found these messages in the > gc log 45 minutes before this event. These messages were misleading because > further investigation shows the tserver will log [Deleting wal | > https://github.com/apache/accumulo/blob/rel/1.6.5/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2607] > when a WAL is truly deleted. There were not such message in the tserver 45 > min earlier, indicating the WAL was not actually deleted. > # GC logs "Removing WAL for offline" at DEBUG. These can roll off pretty > quickly, so change that to INFO. This will help keep history around longer > to aid troubleshooting. > # Verify the "adding 1 logs for extent" is using the srv:lock column to > enforce the constraint. Looks like it is, but if zooLock is null in the > update of MetadataTableUtil maybe badness is happening. > # In GC, maybe keep a map of first time we see a tablet server is down and > don't actually remove the WAL for offline tablet servers until they have been > down an hour or something. Would need to make sure that map is cleared when > a tserver comes back online. -- This message was sent by Atlassian JIRA (v6.3.4#6332)