teamconfx opened a new issue, #6043:
URL: https://github.com/apache/accumulo/issues/6043

   **Describe the bug**
   
   After a graceful tablet server restart, clients calling 
`tableOperations().summaries(table).retrieve()` may receive an empty list even 
when the table contains valid summary data that was flushed before the restart. 
This is a transient race condition that causes intermittent failures (~40% 
occurrence rate) when summary retrieval happens shortly after a tablet server 
restart.
   
   The root cause is in `Gatherer.countFiles()` which queries the metadata 
table for file counts. During the transient window after a tablet server 
restart, this query may return 0 files even though files exist. When 
`countFiles()` returns 0, the `gather()` method short-circuits and returns an 
empty `SummaryCollection`, causing clients to receive empty results.
   
   **Versions (OS, Maven, Java, and others, as appropriate):**
   - Affected version(s): Accumulo 2.1.x
   - OS: Linux (tested on Ubuntu 20.04)
   - Java: OpenJDK 17
   
   **To Reproduce**
   
   1. Create a table with summarization enabled:
      ```java
      SummarizerConfiguration sc = 
SummarizerConfiguration.builder(MySummarizer.class).build();
      NewTableConfiguration ntc = new 
NewTableConfiguration().enableSummarization(sc);
      client.tableOperations().create("testTable", ntc);
      ```
   
   2. Write data and flush to generate summaries:
      ```java
      try (BatchWriter bw = client.createBatchWriter("testTable")) {
          // write some data
      }
      client.tableOperations().flush("testTable", null, null, true);
      ```
   
   3. Verify summaries exist:
      ```java
      List<Summary> summaries = 
client.tableOperations().summaries("testTable").retrieve();
      // summaries.size() == 1, as expected
      ```
   
   4. Gracefully restart a tablet server in the cluster
   
   5. Immediately retrieve summaries again:
      ```java
      List<Summary> summaries = 
client.tableOperations().summaries("testTable").retrieve();
      // summaries.size() == 0 (UNEXPECTED - should be 1)
      summaries.get(0); // throws IndexOutOfBoundsException
      ```
   
   **Expected behavior**
   
   Summary retrieval should return the correct summaries after a tablet server 
restart. The data was flushed before the restart, so summaries should be 
available. The `retrieve()` method should return a non-empty list containing 
the table's summary data.
   
   **Additional context**
   
   **Buggy Code Location:** 
`core/src/main/java/org/apache/accumulo/core/summary/Gatherer.java`
   
   ```java
   // Lines 449-452
   private int countFiles() {
     // TODO use a batch scanner + iterator to parallelize counting files
     return 
TabletsMetadata.builder(ctx).forTable(tableId).overlapping(startRow, endRow)
         .fetch(FILES, PREV_ROW).build().stream().mapToInt(tm -> 
tm.getFiles().size()).sum();
   }
   
   // Lines 495-502
   public Future<SummaryCollection> gather(ExecutorService es) {
     int numFiles = countFiles();
   
     if (numFiles == 0) {
       return CompletableFuture.completedFuture(new SummaryCollection());  // 
Short-circuits here!
     }
     // ... rest of method
   }
   ```
   
   **Race Condition Mechanism:**
   1. Client calls `summaries(table).retrieve()`
   2. A random tablet server is selected to coordinate the summary retrieval
   3. The coordinating tablet server creates a `Gatherer` and calls `gather()`
   4. `gather()` calls `countFiles()` to count files in the metadata table
   5. After tablet server restart, during tablet reassignment, `countFiles()` 
may transiently return 0
   6. `gather()` returns an empty `SummaryCollection` without actually 
gathering summaries
   
   **Proposed Fix:**
   
   Add retry logic to `countFiles()` to handle transient metadata 
inconsistencies:
   
   ```java
   private int countFiles() {
     int maxRetries = 3;
     int retryDelayMs = 100;
   
     for (int attempt = 0; attempt < maxRetries; attempt++) {
       int count = 
TabletsMetadata.builder(ctx).forTable(tableId).overlapping(startRow, endRow)
           .fetch(FILES, PREV_ROW).build().stream().mapToInt(tm -> 
tm.getFiles().size()).sum();
   
       if (count > 0 || attempt == maxRetries - 1) {
         return count;
       }
   
       try {
         Thread.sleep(retryDelayMs);
       } catch (InterruptedException e) {
         Thread.currentThread().interrupt();
         return 0;
       }
     }
     return 0;
   }
   ```
   
   **Workaround:**
   
   Until this is fixed, applications can implement client-side retry:
   
   ```java
   List<Summary> summaries = Collections.emptyList();
   for (int i = 0; i < 3 && summaries.isEmpty(); i++) {
     summaries = client.tableOperations().summaries(tableName).retrieve();
     if (summaries.isEmpty()) {
       Thread.sleep(100);
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to