keith-turner commented on PR #5707:
URL: https://github.com/apache/accumulo/pull/5707#issuecomment-3029095785
There are some notes on what I did to track this bug down. Started w/
running fate list and kept seeing these compactions not moving.
```
TABLE_BULK_IMPORT2 txid: 10ca761c03d9876d status: SUBMITTED
locked: [] locking: [R:4] op: PrepBulkImport created:
2025-07-01T22:56:00.758Z
TABLE_BULK_IMPORT2 txid: 67be5b2616e8163c status: SUBMITTED
locked: [] locking: [R:4] op: PrepBulkImport created:
2025-07-01T22:56:03.157Z
TABLE_COMPACT txid: 2f2ff4546d9052e5 status: IN_PROGRESS locked:
[R:+default, R:4] locking: [] op: CompactionDriver created:
2025-07-01T22:55:29.987Z
TABLE_BULK_IMPORT2 txid: 6102ab68e0ae9b54 status: SUBMITTED
locked: [] locking: [R:4] op: PrepBulkImport created:
2025-07-01T22:56:09.768Z
TABLE_MERGE txid: 188da098b244da57 status: SUBMITTED locked:
[R:+default] locking: [W:4] op: TableRangeOp created:
2025-07-01T22:55:59.759Z
TABLE_BULK_IMPORT2 txid: 545753cd7ac69fd2 status: SUBMITTED
locked: [] locking: [R:4] op: PrepBulkImport created:
2025-07-01T22:56:03.246Z
TABLE_BULK_IMPORT2 txid: 1f12560b9db5834c status: SUBMITTED
locked: [] locking: [R:4] op: PrepBulkImport created:
2025-07-01T22:56:01.417Z
TABLE_COMPACT txid: 5481217c586dcaf6 status: IN_PROGRESS locked:
[R:+default, R:4] locking: [] op: CompactionDriver created:
2025-07-01T22:55:22.614Z
```
Enabled trace logging in the manager and got some info that indicated that
both fate ops were waiting on a single tablet.
```
2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE:
FATE[2f2ff4546d9052e5] tablets compacted:33/34 servers contacted:1 expected
id:49 compaction extent:4;r165f8;r0786c sleepTime:500
2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE:
FATE[5481217c586dcaf6] tablets compacted:56/57 servers contacted:1 expected
id:48 compaction extent:4<;r003f8 sleepTime:500
```
Looked in the metadata table and found the tablet with a lower compact id
that was in the range.
```
4;r13e11 srv:compact [] 50
4;r16d06 srv:compact [] 46
4;r172ae srv:compact [] 50
4;r172da srv:compact [] 50
```
Enabled trace logging on the tablet server and saw the following that helped
get in the neighborhood of the problem. Needed to filter on the tablet.
```
2025-07-02T00:29:17,968 [compactions.CompactionService] TRACE: Did not
submit compaction plan 4;r16d06;r13e11 id:default files:Files
[allFiles=[[C00009ux.rf, 33437 252441], [A00009dw.rf, 3638005 7727960],
[C00009uz.rf, 25527 168294], [I00
009m2.rf, 25286 0], [I00009m6.rf, 18179 0], [I00009lv.rf, 21772 0],
[C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948
204357], [I00009p7.rf, 18369 0]], candidates=[[I00009m6.rf, 18179 0],
[I00009lv.rf, 21772 0], [A0
0009dw.rf, 3638005 7727960], [C00009dz.rf, 18750 132363], [I00009ld.rf,
18232 0], [C00009qf.rf, 27948 204357], [I00009m2.rf, 25286 0]], compacting=[],
hints={}] plan:jobs: [CompactionJob [priority=18, executor=e.small,
files=[[I00009m6.rf
, 18179 0], [I00009lv.rf, 21772 0], [A00009dw.rf, 3638005 7727960],
[C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948
204357], [I00009m2.rf, 25286 0]], kind=USER]]
```
Took a heap dump of the tablet server and ran the following OQL query to
find the tablet object in the heap dump. Got the dir name from the metadata
table. Used the dirName field because its a string, would like to use the
extent to find the tablet but that is a byte array and did not know how to
query that in OQL. Would be useful to figure that out.
```
select t from org.apache.accumulo.tserver.tablet.Tablet t where
t.dirName.toString()=="t-00009q0"
```
After finding the tablet in the heap dump was able to find a ExternalJob
that indirectly referenced the tablet. This ExternalJob had a state of running
and a null ecid. Also the ExternalJob was only referenced by
CompactoinService.submittedJob. This was the information that lead to this bug
fix.
Not 100% this change will fix this problem, but it seems like it will .
Also not sure how to test this fix as its a race condition. May try to see if
this problem is reproducible w/o this fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]