keith-turner opened a new issue, #5870:
URL: https://github.com/apache/accumulo/issues/5870

   **Describe the bug**
   
   While running the bulk random walk test, saw the following failure for the 
merge operation.
   
   ```
   2025-09-05T11:26:27,152 [thrift.ProcessFunction] ERROR: Internal error 
processing waitForFateOperation
   java.lang.IllegalStateException: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 merging tablet 27;r16b92< had 
location Location [server=localhost:10001[1000166bc23001e], type=CURRENT]
           at 
com.google.common.base.Preconditions.checkState(Preconditions.java:853) 
~[guava-33.4.6-jre.jar:?]
           at 
org.apache.accumulo.manager.tableOps.merge.MergeTablets.validateTablet(MergeTablets.java:244)
 ~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.manager.tableOps.merge.MergeTablets.call(MergeTablets.java:90)
 ~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.manager.tableOps.merge.MergeTablets.call(MergeTablets.java:54)
 ~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.manager.tableOps.TraceRepo.call(TraceRepo.java:74) 
~[accumulo-manager-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.core.fate.FateExecutor.executeCall(FateExecutor.java:602) 
~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.execute(FateExecutor.java:486)
 ~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.run(FateExecutor.java:416)
 ~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
 ~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 ~[?:?]
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 ~[?:?]
           at 
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
 ~[accumulo-core-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
           at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
   ```
   
   
   The following are some events in the logs related to the fate operation, 
writing the future location, current location, and last location.  All events 
are from manager log except for the ones w/ tserver prefix.  The Assigned event 
is when the future location is set.  
   
   ```
   2025-09-05T11:26:26,765 [fate.FateExecutor] DEBUG: Running 
TableRangeOp.call() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 1 ms 
and returned ReserveTablets
   2025-09-05T11:26:26,775 [merge.ReserveTablets] DEBUG: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 reserving tablets in range 
27;r0db2f;r035d4
   2025-09-05T11:26:26,777 [tablet.location] DEBUG: Assigned 27;r16b92< to 
localhost:10001[1000166bc23001e]
   tserver_default_2_localhost.log:2025-09-05T11:26:26,777 [tablet.location] 
DEBUG: Loading 27;r16b92< on localhost:10001[1000166bc23001e]
   tserver_default_2_localhost.log:2025-09-05T11:26:26,784 [tablet.location] 
DEBUG: Loaded 27;r16b92< on localhost:10001[1000166bc23001e]
   2025-09-05T11:26:26,788 [merge.ReserveTablets] DEBUG: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 reserve tablets op:MERGE count:1 
other opids:0 opids set:1 locations:0 accepted:1 wals:0
   2025-09-05T11:26:26,788 [fate.FateExecutor] DEBUG: Running 
ReserveTablets.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 13 
ms and returned 0
   2025-09-05T11:26:26,788 [fate.FateExecutor] DEBUG: Running 
ReserveTablets.call() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms 
and returned CountFiles
   2025-09-05T11:26:26,796 [fate.FateExecutor] DEBUG: Running 
CountFiles.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms 
and returned 0
   2025-09-05T11:26:26,802 [merge.CountFiles] DEBUG: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 found 80 files in the merge 
range, maxFiles is 10000
   2025-09-05T11:26:26,802 [fate.FateExecutor] DEBUG: Running CountFiles.call() 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 5 ms and returned 
MergeTablets
   2025-09-05T11:26:26,802 [metadata.ConditionalTabletsMutatorImpl] DEBUG: 
Mutation was rejected, status:REJECTED extent:27;r16b92< row:27;r16b92 
operation description: null
   2025-09-05T11:26:26,808 [fate.FateExecutor] DEBUG: Running 
MergeTablets.isReady() FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 took 0 ms 
and returned 0
   2025-09-05T11:26:26,808 [merge.MergeTablets] DEBUG: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 Merging metadata for 
27;r0db2f;r035d4
   2025-09-05T11:26:26,815 [fate.FateExecutor] WARN : Failed to execute Repo 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9
   java.lang.IllegalStateException: 
FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9 merging tablet 27;r16b92< had 
location Location [server=localhost:10001[1000166bc23001e], type=CURRENT]
   tserver_default_2_localhost.log:2025-09-05T11:26:26,818 [tablet.Tablet] INFO 
: Tablet 27;r16b92< closed.
   tserver_default_2_localhost.log:2025-09-05T11:26:26,823 [tablet.location] 
DEBUG: Unassigned 27;r16b92< with 0 walogs
   ```
   
   The operation id and future location were written around the same time. 
   
   
   The following is scan of the metadata table after the failure that includes 
timestamps.  The timestamps help show the order in which columns were updated.  
This shows the last location was set before the opid.
   
   ```
   27;r16b92 last:1000166bc23001e [] 1044216    localhost:10001
   27;r16b92 srv:dir [] 1043345 t-0008b1g
   27;r16b92 srv:lock [] 1044218        
/tservers/default/localhost:10001/zlock#dd1ee18f-af38-44f8-9562-b52fe43db435#0000000000$1000166bc23001e
   27;r16b92 srv:opid [] 1044217        
MERGING:FATE:USER:adaa71e2-f39e-4194-88f5-a2f71b9551d9
   27;r16b92 srv:time [] 1044208        M1757071586487
   27;r16b92 ~tab:availability [] 1044175       ONDEMAND
   27;r16b92 ~tab:mergeability [] 1044175       {"never":true}
   27;r16b92 ~tab:requestToHost [] 1044213
   27;r16b92 ~tab:~pr [] 1044175        \x00
   ```
   
   
   
   **To Reproduce**
   
   Left the bulk test running in a loop (changed the graph to start a new test 
when on completes) for a long time.
   
   **Expected behavior**
   
   Merge should never see a location on a tablet after it has set a opid on the 
tablet.
   
   Suspect this was caused because this code does not require an absent 
location.  
   
   
https://github.com/apache/accumulo/blob/0ae5f340f2a9575379e81bd8a582168056d19e55/server/manager/src/main/java/org/apache/accumulo/manager/tableOps/merge/ReserveTablets.java#L85
   
   This code should have a requireAbsentOperation() added.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to