teamconfx opened a new issue, #6043:
URL: https://github.com/apache/accumulo/issues/6043
**Describe the bug**
After a graceful tablet server restart, clients calling
`tableOperations().summaries(table).retrieve()` may receive an empty list even
when the table contains valid summary data that was flushed before the restart.
This is a transient race condition that causes intermittent failures (~40%
occurrence rate) when summary retrieval happens shortly after a tablet server
restart.
The root cause is in `Gatherer.countFiles()` which queries the metadata
table for file counts. During the transient window after a tablet server
restart, this query may return 0 files even though files exist. When
`countFiles()` returns 0, the `gather()` method short-circuits and returns an
empty `SummaryCollection`, causing clients to receive empty results.
**Versions (OS, Maven, Java, and others, as appropriate):**
- Affected version(s): Accumulo 2.1.x
- OS: Linux (tested on Ubuntu 20.04)
- Java: OpenJDK 17
**To Reproduce**
1. Create a table with summarization enabled:
```java
SummarizerConfiguration sc =
SummarizerConfiguration.builder(MySummarizer.class).build();
NewTableConfiguration ntc = new
NewTableConfiguration().enableSummarization(sc);
client.tableOperations().create("testTable", ntc);
```
2. Write data and flush to generate summaries:
```java
try (BatchWriter bw = client.createBatchWriter("testTable")) {
// write some data
}
client.tableOperations().flush("testTable", null, null, true);
```
3. Verify summaries exist:
```java
List<Summary> summaries =
client.tableOperations().summaries("testTable").retrieve();
// summaries.size() == 1, as expected
```
4. Gracefully restart a tablet server in the cluster
5. Immediately retrieve summaries again:
```java
List<Summary> summaries =
client.tableOperations().summaries("testTable").retrieve();
// summaries.size() == 0 (UNEXPECTED - should be 1)
summaries.get(0); // throws IndexOutOfBoundsException
```
**Expected behavior**
Summary retrieval should return the correct summaries after a tablet server
restart. The data was flushed before the restart, so summaries should be
available. The `retrieve()` method should return a non-empty list containing
the table's summary data.
**Additional context**
**Buggy Code Location:**
`core/src/main/java/org/apache/accumulo/core/summary/Gatherer.java`
```java
// Lines 449-452
private int countFiles() {
// TODO use a batch scanner + iterator to parallelize counting files
return
TabletsMetadata.builder(ctx).forTable(tableId).overlapping(startRow, endRow)
.fetch(FILES, PREV_ROW).build().stream().mapToInt(tm ->
tm.getFiles().size()).sum();
}
// Lines 495-502
public Future<SummaryCollection> gather(ExecutorService es) {
int numFiles = countFiles();
if (numFiles == 0) {
return CompletableFuture.completedFuture(new SummaryCollection()); //
Short-circuits here!
}
// ... rest of method
}
```
**Race Condition Mechanism:**
1. Client calls `summaries(table).retrieve()`
2. A random tablet server is selected to coordinate the summary retrieval
3. The coordinating tablet server creates a `Gatherer` and calls `gather()`
4. `gather()` calls `countFiles()` to count files in the metadata table
5. After tablet server restart, during tablet reassignment, `countFiles()`
may transiently return 0
6. `gather()` returns an empty `SummaryCollection` without actually
gathering summaries
**Proposed Fix:**
Add retry logic to `countFiles()` to handle transient metadata
inconsistencies:
```java
private int countFiles() {
int maxRetries = 3;
int retryDelayMs = 100;
for (int attempt = 0; attempt < maxRetries; attempt++) {
int count =
TabletsMetadata.builder(ctx).forTable(tableId).overlapping(startRow, endRow)
.fetch(FILES, PREV_ROW).build().stream().mapToInt(tm ->
tm.getFiles().size()).sum();
if (count > 0 || attempt == maxRetries - 1) {
return count;
}
try {
Thread.sleep(retryDelayMs);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return 0;
}
}
return 0;
}
```
**Workaround:**
Until this is fixed, applications can implement client-side retry:
```java
List<Summary> summaries = Collections.emptyList();
for (int i = 0; i < 3 && summaries.isEmpty(); i++) {
summaries = client.tableOperations().summaries(tableName).retrieve();
if (summaries.isEmpty()) {
Thread.sleep(100);
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]