merge failed to complete
------------------------
Key: ACCUMULO-356
URL: https://issues.apache.org/jira/browse/ACCUMULO-356
Project: Accumulo
Issue Type: Bug
Components: master
Affects Versions: 1.4.0
Environment: random walk test of "All" without "Security" on a 10-node
test cluster
Reporter: Eric Newton
Assignee: Eric Newton
Noticed the random walk test was generally hung. There was a tablet offline.
Listed the Fate operations:
{{{
$ ./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 59c0403614dc0c39 status: IN_PROGRESS op: RenameTable
locked: [] locking: [W:cz] top: RenameTable
txid: 37539f8d61548764 status: IN_PROGRESS op: ChangeTableState
locked: [] locking: [W:cz] top: ChangeTableState
txid: 02f8323a3136e60d status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:cz] top: TableRangeOp
txid: 044015732e97eec1 status: IN_PROGRESS op: CompactRange
locked: [] locking: [R:cz] top: CompactRange
txid: 6ce9dd63f9d51448 status: IN_PROGRESS op: CompactRange
locked: [] locking: [R:cz] top: CompactRange
txid: 417cb9b60e44ecd9 status: IN_PROGRESS op: TableRangeOp
locked: [] locking: [W:cz] top: TableRangeOp
txid: 5e7c5284a4677d6c status: IN_PROGRESS op: DeleteTable
locked: [] locking: [W:cz] top: DeleteTable
txid: 6633d3d841d66995 status: IN_PROGRESS op: TableRangeOp
locked: [W:cz] locking: [] top: TableRangeOpWait
}}}
Quick check of master logs finds that Fate tx 6633d3d841d66995 did not
complete. Here's what the master saw for this transaction:
The transaction is seeded, and a write lock is put into the queue; but there
are other operations going on
{{{
30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 4
userData 6633d3d841d66995 lockType WRITE
}}}
For example, there was a read lock for transaction 56fada3359bfb4c7
{{{
30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked
for read
}}}
And yet another write lock ahead of us for transaction 62c3cecaf387398e
{{{
30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked for
write operation: ONLINE
30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz
from OFFLINE to ONLINE
30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @
WatchedEvent state:SyncConnected type:NodeDataChanged
path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state
30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper
changed for cz to ONLINE
30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked
for write
}}}
So now our table is in the online state... but the tablets may not be (yet).
Our Fate operation gets the write lock and starts to execute
{{{
30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz ONLINE
30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to ONLINE
30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked for
write operation: MERGE
}}}
Hmm... this seems strange... we went from STARTED directly to MERGING: no
OFFLINE, no CHOPPING. Those are very strange MergeStats, too. I'm not 100%
sure that these merge stats go with this operation, but the timing suggests
that it does (need to fix that log message).
{{{
30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of
cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED
30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned
30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of
cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
}}}
Looks like the tablets were offline, too, because the main master loop is
trying to assign them:
{{{
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e]
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b]
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f]
30 20:52:29,615 [master.EventCoordinator] INFO : tablet
cz;08498ad358a45bb0;050754dd3cc35c16 was loaded
30 20:52:29,616 [master.EventCoordinator] INFO : tablet
cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
}}}
And, just to make this more interesting, a tablet splits shortly after being
loaded:
{{{
30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned
30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9
was unloaded
30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported
split cz;08498ad358a45bb0;050754dd3cc35c16, cz;137d382d52b8fbd9;08498ad358a45bb0
}}}
Master takes some of the tablets offline
{{{
30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16<
was unloaded
30 20:52:30,083 [master.EventCoordinator] INFO : tablet
cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded
30 20:52:30,084 [master.EventCoordinator] INFO : tablet
cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded
But not all of them
{{{
30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16<
was loaded
}}}
The !METADATA status
{{{
!METADATA> scan -b cz; -e d
cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf [] 50848,0
cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf [] 2936,660
cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf [] 3176,720
cz;050754dd3cc35c16 last:334c843791cb78b [] 192.168.117.5:9997
cz;050754dd3cc35c16 loc:334c843791cb78b [] 192.168.117.5:9997
cz;050754dd3cc35c16 srv:dir [] /t-00006p2
cz;050754dd3cc35c16 srv:flush [] 0
cz;050754dd3cc35c16 srv:lock []
tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b
cz;050754dd3cc35c16 srv:time [] M1327993277526
cz;050754dd3cc35c16 ~tab:~pr [] \x00
cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf [] 47216,0
cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf [] 1883,400
cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf [] 1717,360
cz;08498ad358a45bb0 last:134d7425fc8959e [] 192.168.117.4:9997
cz;08498ad358a45bb0 loc:134d7425fc8959e [] 192.168.117.4:9997
cz;08498ad358a45bb0 srv:dir [] /t-00006p3
cz;08498ad358a45bb0 srv:flush [] 0
cz;08498ad358a45bb0 srv:lock []
tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e
cz;08498ad358a45bb0 srv:time [] M1327993277506
cz;08498ad358a45bb0 ~tab:~pr [] \x01050754dd3cc35c16
cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf [] 118041,0
cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf [] 6017,1430
cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf [] 5409,1270
cz;137d382d52b8fbd9 last:134d7425fc8959f [] 192.168.117.6:9997
cz;137d382d52b8fbd9 loc:134d7425fc8959f [] 192.168.117.6:9997
cz;137d382d52b8fbd9 srv:dir [] /t-00006os
cz;137d382d52b8fbd9 srv:flush [] 0
cz;137d382d52b8fbd9 srv:lock []
tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f
cz;137d382d52b8fbd9 srv:time [] M1327993277515
cz;137d382d52b8fbd9 ~tab:~pr [] \x0108498ad358a45bb0
cz< chopped:chopped [] chopped
cz< file:/b-00006eb/I00006ec.rf [] 1210190,0
cz< srv:compact [] 1
cz< srv:dir [] /default_tablet
cz< srv:lock [] tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f
cz< srv:time [] M0
cz< ~tab:~pr [] \x01137d382d52b8fbd9
}}}
4 tablets, consistent splits, 3 are online, the offline tablet is chopped. The
offline tablet doesn't have a last location, which is strange.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira