[
https://issues.apache.org/jira/browse/ACCUMULO-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Newton updated ACCUMULO-436:
---------------------------------
Description:
After 14 hours of randomwalk, a merge operation appeared to be stuck.
Garbage collector was stuck, some tablets were offline:
||\# Online Tablet Servers|| \# Total Tablet Servers|| Loggers Last GC
||\# Tablets ||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS
Load||
|10 |10 |10 |*Running 2/29/12 12:14 PM* |299 |*4*
|277.50M |311 |5.53K |— |0.50|
Garbage collector could not get a consistent !METADATA table scan:
{noformat}
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to
[24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : []
9223372036854775807 false)
29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency :
1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
{noformat}
Table (id 24q) had a merge in progress:
{noformat}
./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 7bea12fa46c40a72 status: IN_PROGRESS op: BulkImport
locked: [] locking: [R:24q] top: BulkImport
txid: 08db6105a25c0788 status: IN_PROGRESS op: CloneTable
locked: [] locking: [R:24q] top: CloneTable
txid: 5f798db1cab5fdea status: IN_PROGRESS op: BulkImport
locked: [] locking: [R:24q] top: BulkImport
txid: 6aa9a8a9b36a4f4d status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:24q] top: TableRangeOp
txid: 5c6e82e235ec3855 status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:24q] top: TableRangeOp
txid: 653a9293ba9f1cdc status: IN_PROGRESS op: RenameTable
locked: [] locking: [W:24q] top: RenameTable
txid: 651c62eb37136b6e status: IN_PROGRESS op: TableRangeOp
locked: [W:24q] locking: [] top: TableRangeOpWait
{noformat}
Scan of table 24q:
{noformat}
scan -b 24q; -e 24q<
24q;073b220b74a75059 loc:135396fb191d4b6 [] 192.168.117.6:9997
24q;073b220b74a75059 srv:compact [] 3
24q;073b220b74a75059 srv:dir [] /t-00031y0
24q;073b220b74a75059 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;073b220b74a75059 srv:time [] M0
24q;073b220b74a75059 ~tab:~pr [] \x00
24q;1419e44259517c51 loc:235396fb184b5cd [] 192.168.117.12:9997
24q;1419e44259517c51 srv:compact [] 3
24q;1419e44259517c51 srv:dir [] /t-00031y1
24q;1419e44259517c51 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;1419e44259517c51 srv:time [] M0
24q;1419e44259517c51 ~tab:~pr [] \x01073b220b74a75059
24q;51fc3e7faea2b7e9 chopped:chopped [] chopped
24q;51fc3e7faea2b7e9 srv:compact [] 3
24q;51fc3e7faea2b7e9 srv:dir [] /t-00031y2
24q;51fc3e7faea2b7e9 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;51fc3e7faea2b7e9 srv:time [] M0
24q;51fc3e7faea2b7e9 ~tab:~pr [] \x011419e44259517c51
24q;5e65b844f2c7f868 chopped:chopped [] chopped
24q;5e65b844f2c7f868 srv:compact [] 3
24q;5e65b844f2c7f868 srv:dir [] /t-00031e1
24q;5e65b844f2c7f868 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;5e65b844f2c7f868 srv:time [] M0
24q;5e65b844f2c7f868 ~tab:~pr [] \x0151fc3e7faea2b7e9
24q;5f83b8f927c41c9d chopped:chopped [] chopped
24q;5f83b8f927c41c9d srv:compact [] 3
24q;5f83b8f927c41c9d srv:dir [] /t-000329w
24q;5f83b8f927c41c9d srv:lock []
tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q;5f83b8f927c41c9d srv:time [] M0
24q;5f83b8f927c41c9d ~tab:~pr [] \x015e65b844f2c7f868
24q< chopped:chopped [] chopped
24q< srv:compact [] 3
24q< srv:dir [] /default_tablet
24q< srv:lock [] tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q< srv:time [] M0
24q< ~tab:~pr [] \x011419e44259517c51
{noformat}
Master Logs
{noformat}
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for
24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete :
false
29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned
24q;6badf28df1d8ece7;37f3488aa92ac056
{noformat}
The final consistency check is failing because the merge is partially complete.
The final step is not "adampotent" enough: partial execution leaves the Repo
in a state in which it cannot continue after restart.
was:
After 14 hours of randomwalk, a merge operation appeared to be stuck.
Garbage collector was stuck, some tablets were offline:
||\# Online Tablet Servers|| \# Total Tablet Servers|| Loggers Last GC
||\# Tablets ||\# Unassigned||Tablets||Entries||Ingest||Query||Hold Time||OS
Load||
|10 |10 |10 |*Running 2/29/12 12:14 PM* |299 |*4*
|277.50M |311 |5.53K |— |0.50|
Garbage collector could not get a consistent !METADATA table scan:
{noformat}
29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to
[24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : []
9223372036854775807 false)
29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency :
1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883 false
{noformat}
Table (id 24q) had a merge in progress:
{noformat}
./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 7bea12fa46c40a72 status: IN_PROGRESS op: BulkImport
locked: [] locking: [R:24q] top: BulkImport
txid: 08db6105a25c0788 status: IN_PROGRESS op: CloneTable
locked: [] locking: [R:24q] top: CloneTable
txid: 5f798db1cab5fdea status: IN_PROGRESS op: BulkImport
locked: [] locking: [R:24q] top: BulkImport
txid: 6aa9a8a9b36a4f4d status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:24q] top: TableRangeOp
txid: 5c6e82e235ec3855 status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:24q] top: TableRangeOp
txid: 653a9293ba9f1cdc status: IN_PROGRESS op: RenameTable
locked: [] locking: [W:24q] top: RenameTable
txid: 651c62eb37136b6e status: IN_PROGRESS op: TableRangeOp
locked: [W:24q] locking: [] top: TableRangeOpWait
{noformat}
Scan of table 24q:
{noformat}
scan -b 24q; -e 24q<
24q;073b220b74a75059 loc:135396fb191d4b6 [] 192.168.117.6:9997
24q;073b220b74a75059 srv:compact [] 3
24q;073b220b74a75059 srv:dir [] /t-00031y0
24q;073b220b74a75059 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;073b220b74a75059 srv:time [] M0
24q;073b220b74a75059 ~tab:~pr [] \x00
24q;1419e44259517c51 loc:235396fb184b5cd [] 192.168.117.12:9997
24q;1419e44259517c51 srv:compact [] 3
24q;1419e44259517c51 srv:dir [] /t-00031y1
24q;1419e44259517c51 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;1419e44259517c51 srv:time [] M0
24q;1419e44259517c51 ~tab:~pr [] \x01073b220b74a75059
24q;51fc3e7faea2b7e9 chopped:chopped [] chopped
24q;51fc3e7faea2b7e9 srv:compact [] 3
24q;51fc3e7faea2b7e9 srv:dir [] /t-00031y2
24q;51fc3e7faea2b7e9 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;51fc3e7faea2b7e9 srv:time [] M0
24q;51fc3e7faea2b7e9 ~tab:~pr [] \x011419e44259517c51
24q;5e65b844f2c7f868 chopped:chopped [] chopped
24q;5e65b844f2c7f868 srv:compact [] 3
24q;5e65b844f2c7f868 srv:dir [] /t-00031e1
24q;5e65b844f2c7f868 srv:lock []
tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
24q;5e65b844f2c7f868 srv:time [] M0
24q;5e65b844f2c7f868 ~tab:~pr [] \x0151fc3e7faea2b7e9
24q;5f83b8f927c41c9d chopped:chopped [] chopped
24q;5f83b8f927c41c9d srv:compact [] 3
24q;5f83b8f927c41c9d srv:dir [] /t-000329w
24q;5f83b8f927c41c9d srv:lock []
tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q;5f83b8f927c41c9d srv:time [] M0
24q;5f83b8f927c41c9d ~tab:~pr [] \x015e65b844f2c7f868
24q< chopped:chopped [] chopped
24q< srv:compact [] 3
24q< srv:dir [] /default_tablet
24q< srv:lock [] tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
24q< srv:time [] M0
24q< ~tab:~pr [] \x011419e44259517c51
{nofromat}
Master Logs
{noformat}
29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for
24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete :
false
29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned
24q;6badf28df1d8ece7;37f3488aa92ac056
{noformat}
The final consistency check is failing because the delete is partially
complete. The delete step is not "adampotent" enough: partial execution leaves
the Repo in a state in which it cannot continue after restart.
> tablet merge stuck
> ------------------
>
> Key: ACCUMULO-436
> URL: https://issues.apache.org/jira/browse/ACCUMULO-436
> Project: Accumulo
> Issue Type: Bug
> Components: master
> Environment: randomwalk with agitation on 10-node test cluster
> Reporter: Eric Newton
> Assignee: Eric Newton
> Fix For: 1.4.0
>
>
> After 14 hours of randomwalk, a merge operation appeared to be stuck.
> Garbage collector was stuck, some tablets were offline:
> ||\# Online Tablet Servers|| \# Total Tablet Servers|| Loggers Last GC
> ||\# Tablets ||\# Unassigned||Tablets||Entries||Ingest||Query||Hold
> Time||OS Load||
> |10 |10 |10 |*Running 2/29/12 12:14 PM* |299 |*4*
> |277.50M |311 |5.53K |— |0.50|
> Garbage collector could not get a consistent !METADATA table scan:
> {noformat}
> 29 13:04:10,808 [util.TabletIterator] INFO : Resetting !METADATA scanner to
> [24q;5f83b8f927c41c9d%00; : [] 9223372036854775807 false,~ : []
> 9223372036854775807 false)
> 29 13:04:11,071 [util.TabletIterator] INFO : Metadata inconsistency :
> 1419e44259517c51 != 5f83b8f927c41c9d metadataKey = 24q< ~tab:~pr [] 724883
> false
> {noformat}
> Table (id 24q) had a merge in progress:
> {noformat}
> ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 7bea12fa46c40a72 status: IN_PROGRESS op: BulkImport
> locked: [] locking: [R:24q] top: BulkImport
> txid: 08db6105a25c0788 status: IN_PROGRESS op: CloneTable
> locked: [] locking: [R:24q] top: CloneTable
> txid: 5f798db1cab5fdea status: IN_PROGRESS op: BulkImport
> locked: [] locking: [R:24q] top: BulkImport
> txid: 6aa9a8a9b36a4f4d status: IN_PROGRESS op: TableRangeOp
> locked: [] locking: [W:24q] top: TableRangeOp
> txid: 5c6e82e235ec3855 status: IN_PROGRESS op: TableRangeOp
> locked: [] locking: [W:24q] top: TableRangeOp
> txid: 653a9293ba9f1cdc status: IN_PROGRESS op: RenameTable
> locked: [] locking: [W:24q] top: RenameTable
> txid: 651c62eb37136b6e status: IN_PROGRESS op: TableRangeOp
> locked: [W:24q] locking: [] top: TableRangeOpWait
> {noformat}
> Scan of table 24q:
> {noformat}
> scan -b 24q; -e 24q<
> 24q;073b220b74a75059 loc:135396fb191d4b6 [] 192.168.117.6:9997
> 24q;073b220b74a75059 srv:compact [] 3
> 24q;073b220b74a75059 srv:dir [] /t-00031y0
> 24q;073b220b74a75059 srv:lock []
> tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;073b220b74a75059 srv:time [] M0
> 24q;073b220b74a75059 ~tab:~pr [] \x00
> 24q;1419e44259517c51 loc:235396fb184b5cd [] 192.168.117.12:9997
> 24q;1419e44259517c51 srv:compact [] 3
> 24q;1419e44259517c51 srv:dir [] /t-00031y1
> 24q;1419e44259517c51 srv:lock []
> tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;1419e44259517c51 srv:time [] M0
> 24q;1419e44259517c51 ~tab:~pr [] \x01073b220b74a75059
> 24q;51fc3e7faea2b7e9 chopped:chopped [] chopped
> 24q;51fc3e7faea2b7e9 srv:compact [] 3
> 24q;51fc3e7faea2b7e9 srv:dir [] /t-00031y2
> 24q;51fc3e7faea2b7e9 srv:lock []
> tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;51fc3e7faea2b7e9 srv:time [] M0
> 24q;51fc3e7faea2b7e9 ~tab:~pr [] \x011419e44259517c51
> 24q;5e65b844f2c7f868 chopped:chopped [] chopped
> 24q;5e65b844f2c7f868 srv:compact [] 3
> 24q;5e65b844f2c7f868 srv:dir [] /t-00031e1
> 24q;5e65b844f2c7f868 srv:lock []
> tservers/192.168.117.7:9997/zlock-0000000002$3353986642ea7f3
> 24q;5e65b844f2c7f868 srv:time [] M0
> 24q;5e65b844f2c7f868 ~tab:~pr [] \x0151fc3e7faea2b7e9
> 24q;5f83b8f927c41c9d chopped:chopped [] chopped
> 24q;5f83b8f927c41c9d srv:compact [] 3
> 24q;5f83b8f927c41c9d srv:dir [] /t-000329w
> 24q;5f83b8f927c41c9d srv:lock []
> tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q;5f83b8f927c41c9d srv:time [] M0
> 24q;5f83b8f927c41c9d ~tab:~pr [] \x015e65b844f2c7f868
> 24q< chopped:chopped [] chopped
> 24q< srv:compact [] 3
> 24q< srv:dir [] /default_tablet
> 24q< srv:lock []
> tservers/192.168.117.6:9997/zlock-0000000002$135396fb191c4f3
> 24q< srv:time [] M0
> 24q< ~tab:~pr [] \x011419e44259517c51
> {noformat}
> Master Logs
> {noformat}
> 29 13:11:49,903 [state.MergeStats] INFO : Computing next merge state for
> 24q;6badf28df1d8ece7;37f3488aa92ac056 which is presently MERGING isDelete :
> false
> 29 13:11:49,903 [state.MergeStats] INFO : 4 tablets are unassigned
> 24q;6badf28df1d8ece7;37f3488aa92ac056
> {noformat}
> The final consistency check is failing because the merge is partially
> complete. The final step is not "adampotent" enough: partial execution
> leaves the Repo in a state in which it cannot continue after restart.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira