keith-turner opened a new issue, #5870:
URL: https://github.com/apache/accumulo/issues/5870
**Describe the bug**
While running the bulk random walk test, saw the following failure for the
merge operation.
```
2025-09-05T11:26:27,152 [thrift.ProcessFunction] ERROR: Internal error
processing waitForFateOperation
java.lang.IllegalStateException:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 merging tablet 27;r16b92< had
location Location [server=localhost:10001[1000166bc23001e], type=CURRENT]
at
com.google.common.base.Preconditions.checkState(Preconditions.java:853)
~[guava-33.4.6-jre.jar:?]
at
org.apache.accumulo.manager.tableOps.merge.MergeTablets.validateTablet(MergeTablets.java:244)
~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.manager.tableOps.merge.MergeTablets.call(MergeTablets.java:90)
~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.manager.tableOps.merge.MergeTablets.call(MergeTablets.java:54)
~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.manager.tableOps.TraceRepo.call(TraceRepo.java:74)
~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.core.fate.FateExecutor.executeCall(FateExecutor.java:602)
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.execute(FateExecutor.java:486)
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.run(FateExecutor.java:416)
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
~[?:?]
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
~[?:?]
at
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
```
The following are some events in the logs related to the fate operation,
writing the future location, current location, and last location. All events
are from manager log except for the ones w/ tserver prefix. The Assigned event
is when the future location is set.
```
2025-09-05T11:26:26,765 [fate.FateExecutor] DEBUG: Running
TableRangeOp.call() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 1 ms
and returned ReserveTablets
2025-09-05T11:26:26,775 [merge.ReserveTablets] DEBUG:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 reserving tablets in range
27;r0db2f;r035d4
2025-09-05T11:26:26,777 [tablet.location] DEBUG: Assigned 27;r16b92< to
localhost:10001[1000166bc23001e]
tserver_default_2_localhost.log:2025-09-05T11:26:26,777 [tablet.location]
DEBUG: Loading 27;r16b92< on localhost:10001[1000166bc23001e]
tserver_default_2_localhost.log:2025-09-05T11:26:26,784 [tablet.location]
DEBUG: Loaded 27;r16b92< on localhost:10001[1000166bc23001e]
2025-09-05T11:26:26,788 [merge.ReserveTablets] DEBUG:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 reserve tablets op:MERGE count:1
other opids:0 opids set:1 locations:0 accepted:1 wals:0
2025-09-05T11:26:26,788 [fate.FateExecutor] DEBUG: Running
ReserveTablets.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 13
ms and returned 0
2025-09-05T11:26:26,788 [fate.FateExecutor] DEBUG: Running
ReserveTablets.call() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms
and returned CountFiles
2025-09-05T11:26:26,796 [fate.FateExecutor] DEBUG: Running
CountFiles.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms
and returned 0
2025-09-05T11:26:26,802 [merge.CountFiles] DEBUG:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 found 80 files in the merge
range, maxFiles is 10000
2025-09-05T11:26:26,802 [fate.FateExecutor] DEBUG: Running CountFiles.call()
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 5 ms and returned
MergeTablets
2025-09-05T11:26:26,802 [metadata.ConditionalTabletsMutatorImpl] DEBUG:
Mutation was rejected, status:REJECTED extent:27;r16b92< row:27;r16b92
operation description: null
2025-09-05T11:26:26,808 [fate.FateExecutor] DEBUG: Running
MergeTablets.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms
and returned 0
2025-09-05T11:26:26,808 [merge.MergeTablets] DEBUG:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 Merging metadata for
27;r0db2f;r035d4
2025-09-05T11:26:26,815 [fate.FateExecutor] WARN : Failed to execute Repo
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9
java.lang.IllegalStateException:
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 merging tablet 27;r16b92< had
location Location [server=localhost:10001[1000166bc23001e], type=CURRENT]
tserver_default_2_localhost.log:2025-09-05T11:26:26,818 [tablet.Tablet] INFO
: Tablet 27;r16b92< closed.
tserver_default_2_localhost.log:2025-09-05T11:26:26,823 [tablet.location]
DEBUG: Unassigned 27;r16b92< with 0 walogs
```
The operation id and future location were written around the same time.
The following is scan of the metadata table after the failure that includes
timestamps. The timestamps help show the order in which columns were updated.
This shows the last location was set before the opid.
```
27;r16b92 last:1000166bc23001e [] 1044216 localhost:10001
27;r16b92 srv:dir [] 1043345 t-0008b1g
27;r16b92 srv:lock [] 1044218
/tservers/default/localhost:10001/zlock#dd1ee18f-af38-44f8-9562-b52fe43db435#0000000000$1000166bc23001e
27;r16b92 srv:opid [] 1044217
MERGING:FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9
27;r16b92 srv:time [] 1044208 M1757071586487
27;r16b92 ~tab:availability [] 1044175 ONDEMAND
27;r16b92 ~tab:mergeability [] 1044175 {"never":true}
27;r16b92 ~tab:requestToHost [] 1044213
27;r16b92 ~tab:~pr [] 1044175 \x00
```
**To Reproduce**
Left the bulk test running in a loop (changed the graph to start a new test
when on completes) for a long time.
**Expected behavior**
Merge should never see a location on a tablet after it has set a opid on the
tablet.
Suspect this was caused because this code does not require an absent
location.
https://github.com/apache/accumulo/blob/0ae5f340f2a9575379e81bd8a582168056d19e55/server/manager/src/main/java/org/apache/accumulo/manager/tableOps/merge/ReserveTablets.java#L85
This code should have a requireAbsentOperation() added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]