[
https://issues.apache.org/jira/browse/ACCUMULO-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner updated ACCUMULO-356:
----------------------------------
Affects Version/s: (was: 1.4.0)
Fix Version/s: 1.4.0
> merge failed to complete
> ------------------------
>
> Key: ACCUMULO-356
> URL: https://issues.apache.org/jira/browse/ACCUMULO-356
> Project: Accumulo
> Issue Type: Bug
> Components: master
> Environment: random walk test of "All" without "Security" on a
> 10-node test cluster
> Reporter: Eric Newton
> Assignee: Eric Newton
> Fix For: 1.4.0
>
>
> Noticed the random walk test was generally hung. There was a tablet offline.
> Listed the Fate operations:
> {noformat}
> $ ./bin/accumulo org.apache.accumulo.server.fate.Admin print
> txid: 59c0403614dc0c39 status: IN_PROGRESS op: RenameTable
> locked: [] locking: [W:cz] top: RenameTable
> txid: 37539f8d61548764 status: IN_PROGRESS op: ChangeTableState
> locked: [] locking: [W:cz] top: ChangeTableState
> txid: 02f8323a3136e60d status: IN_PROGRESS op: TableRangeOp
> locked: [] locking: [W:cz] top: TableRangeOp
> txid: 044015732e97eec1 status: IN_PROGRESS op: CompactRange
> locked: [] locking: [R:cz] top: CompactRange
> txid: 6ce9dd63f9d51448 status: IN_PROGRESS op: CompactRange
> locked: [] locking: [R:cz] top: CompactRange
> txid: 417cb9b60e44ecd9 status: IN_PROGRESS op: TableRangeOp
> locked: [] locking: [W:cz] top: TableRangeOp
> txid: 5e7c5284a4677d6c status: IN_PROGRESS op: DeleteTable
> locked: [] locking: [W:cz] top: DeleteTable
> txid: 6633d3d841d66995 status: IN_PROGRESS op: TableRangeOp
> locked: [W:cz] locking: [] top: TableRangeOpWait
> {noformat}
> Quick check of master logs finds that Fate tx 6633d3d841d66995 did not
> complete. Here's what the master saw for this transaction:
> The transaction is seeded, and a write lock is put into the queue; but there
> are other operations going on
> {noformat}
> 30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry
> 4 userData 6633d3d841d66995 lockType WRITE
> {noformat}
> For example, there was a read lock for transaction 56fada3359bfb4c7
> {noformat}
> 30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked
> for read
> {noformat}
> And yet another write lock ahead of us for transaction 62c3cecaf387398e
> {noformat}
> 30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked
> for write operation: ONLINE
> 30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz
> from OFFLINE to ONLINE
> 30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @
> WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state
> 30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper
> changed for cz to ONLINE
> 30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked
> for write
> {noformat}
> So now our table is in the online state... but the tablets may not be (yet).
> Our Fate operation gets the write lock and starts to execute
> {noformat}
> 30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz
> ONLINE
> 30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to
> ONLINE
> 30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked
> for write operation: MERGE
> {noformat}
> Hmm... this seems strange... we went from STARTED directly to MERGING: no
> OFFLINE, no CHOPPING. Those are very strange MergeStats, too. I'm not 100%
> sure that these merge stats go with this operation, but the timing suggests
> that it does (need to fix that log message).
> {noformat}
> 30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of
> cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED
> 30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline
> 30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned
> 30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of
> cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
> {noformat}
> Looks like the tablets were offline, too, because the main master loop is
> trying to assign them:
> {noformat}
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
> cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e]
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
> cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b]
> 30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet
> cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f]
> 30 20:52:29,615 [master.EventCoordinator] INFO : tablet
> cz;08498ad358a45bb0;050754dd3cc35c16 was loaded
> 30 20:52:29,616 [master.EventCoordinator] INFO : tablet
> cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
> {noformat}
> And, just to make this more interesting, a tablet splits shortly after being
> loaded:
> {noformat}
> 30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned
> 30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9
> was unloaded
> 30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported
> split cz;08498ad358a45bb0;050754dd3cc35c16,
> cz;137d382d52b8fbd9;08498ad358a45bb0
> {noformat}
> Master takes some of the tablets offline
> {noformat}
> 30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16<
> was unloaded
> 30 20:52:30,083 [master.EventCoordinator] INFO : tablet
> cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded
> 30 20:52:30,084 [master.EventCoordinator] INFO : tablet
> cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded
> {noformat}
> But not all of them
> {noformat}
> 30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16<
> was loaded
> {noformat}
> The !METADATA status
> {noformat}
> !METADATA> scan -b cz; -e d
> cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf [] 50848,0
> cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf [] 2936,660
> cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf [] 3176,720
> cz;050754dd3cc35c16 last:334c843791cb78b [] 192.168.117.5:9997
> cz;050754dd3cc35c16 loc:334c843791cb78b [] 192.168.117.5:9997
> cz;050754dd3cc35c16 srv:dir [] /t-00006p2
> cz;050754dd3cc35c16 srv:flush [] 0
> cz;050754dd3cc35c16 srv:lock []
> tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b
> cz;050754dd3cc35c16 srv:time [] M1327993277526
> cz;050754dd3cc35c16 ~tab:~pr [] \x00
> cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf [] 47216,0
> cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf [] 1883,400
> cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf [] 1717,360
> cz;08498ad358a45bb0 last:134d7425fc8959e [] 192.168.117.4:9997
> cz;08498ad358a45bb0 loc:134d7425fc8959e [] 192.168.117.4:9997
> cz;08498ad358a45bb0 srv:dir [] /t-00006p3
> cz;08498ad358a45bb0 srv:flush [] 0
> cz;08498ad358a45bb0 srv:lock []
> tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e
> cz;08498ad358a45bb0 srv:time [] M1327993277506
> cz;08498ad358a45bb0 ~tab:~pr [] \x01050754dd3cc35c16
> cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf [] 118041,0
> cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf [] 6017,1430
> cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf [] 5409,1270
> cz;137d382d52b8fbd9 last:134d7425fc8959f [] 192.168.117.6:9997
> cz;137d382d52b8fbd9 loc:134d7425fc8959f [] 192.168.117.6:9997
> cz;137d382d52b8fbd9 srv:dir [] /t-00006os
> cz;137d382d52b8fbd9 srv:flush [] 0
> cz;137d382d52b8fbd9 srv:lock []
> tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f
> cz;137d382d52b8fbd9 srv:time [] M1327993277515
> cz;137d382d52b8fbd9 ~tab:~pr [] \x0108498ad358a45bb0
> cz< chopped:chopped [] chopped
> cz< file:/b-00006eb/I00006ec.rf [] 1210190,0
> cz< srv:compact [] 1
> cz< srv:dir [] /default_tablet
> cz< srv:lock []
> tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f
> cz< srv:time [] M0
> cz< ~tab:~pr [] \x01137d382d52b8fbd9
> {noformat}
> 4 tablets, consistent splits, 3 are online, the offline tablet is chopped.
> The offline tablet doesn't have a last location, which is strange.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira