keith-turner commented on code in PR #2792:
URL: https://github.com/apache/accumulo/pull/2792#discussion_r977593844


##########
server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java:
##########
@@ -216,6 +224,46 @@ private long 
removeBlipCandidates(GarbageCollectionEnvironment gce,
     return blipCount;
   }
 
+  @VisibleForTesting
+  /**
+   * Double check no tables were missed during GC
+   */
+  protected void ensureAllTablesChecked(Set<TableId> tableIdsBefore, 
Set<TableId> tableIdsSeen,
+      Set<TableId> tableIdsAfter) {
+
+    // if a table was added or deleted during this run, it is acceptable to not
+    // have seen those tables ids when scanning the metadata table. So get the 
intersection
+    final Set<TableId> tableIdsMustHaveSeen = new HashSet<>(tableIdsBefore);
+    tableIdsMustHaveSeen.retainAll(tableIdsAfter);
+
+    if (tableIdsMustHaveSeen.isEmpty() && !tableIdsSeen.isEmpty()) {
+      throw new RuntimeException("Garbage collection will not proceed because "
+          + "table ids were seen in the metadata table and none were seen 
Zookeeper. "
+          + "This can have two causes. First, total number of tables going 
to/from "
+          + "zero during a GC cycle will cause this. Second, it could be 
caused by "
+          + "corruption of the metadata table and/or Zookeeper. Only the 
second cause "
+          + "is problematic, but there is no way to distinguish between the 
two causes "
+          + "so this GC cycle will not proceed. The first cause should be 
transient "
+          + "and one would not expect to see this message repeated in 
subsequent GC cycles.");
+    }

Review Comment:
   > We specifically have the initial GC delay shorter (30s) than the cycle 
delay (5m), which makes this more likely to occur if the first table they 
create is around the 30 second mark when the first GC cycle is starting to run.
   
   I was thinking through this situation wondering how probable and in the 
process determined that this particular situation is probably not possible.  
For this case there would be no GC candidates.  I wasn't sure what the garbage 
collector did when there are no candidates.  I looked and found the following 
code.
   
   
https://github.com/apache/accumulo/blob/3e1225f0db469f8fa76be7cc4e72961696b57a2f/server/gc/src/main/java/org/apache/accumulo/gc/GarbageCollectionAlgorithm.java#L314-L333
   
   Looking at that, if `candidatesIter` does not initially have a next then it 
will not scan for references and run the validation added in this PR.  I was 
happy to find the GC does no unnecessary work when there are no candidates.
   
   A user would need to at least do the following to possibly bump into this 
message.
   
    * create a table
    * initiate activity that creates gc candidates like major compactions
    * delete the table
   
   The delete would have to overlap in time with the GC reading tables ids from 
ZK and reading references from the metadata table.  For a system with hardly 
any tables or tablets reading them will be extremely fast making the 
probability of a one off user operation overlapping extremely small.  A user 
has to delete a table within a 5ms to 10ms time window that happens after 30s 
and then every 5 min to see this message.  So it seems unlikely someone would 
bump into this.  It would  be good to run through a situation like the above a 
few times and see what happens, I will give that a try.
   
   Where this message will likely be seen is on a system that has lots of 
tables and tablets (makes GC cycle take longer) and frequently goes to/from 
zero total tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to