[jira] [Created] (ACCUMULO-356) merge failed to complete

Eric Newton (Created) (JIRA) Tue, 31 Jan 2012 06:10:37 -0800

merge failed to complete
------------------------

                 Key: ACCUMULO-356
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-356
             Project: Accumulo
          Issue Type: Bug
          Components: master
    Affects Versions: 1.4.0
         Environment: random walk test of "All" without "Security" on a 10-node 
test cluster
            Reporter: Eric Newton
            Assignee: Eric Newton



Noticed the random walk test was generally hung.  There was a tablet offline.  
Listed the Fate operations:

{{{
$ ./bin/accumulo org.apache.accumulo.server.fate.Admin print
txid: 59c0403614dc0c39  status: IN_PROGRESS         op: RenameTable      
locked: []              locking: [W:cz]          top: RenameTable
txid: 37539f8d61548764  status: IN_PROGRESS         op: ChangeTableState  
locked: []              locking: [W:cz]          top: ChangeTableState
txid: 02f8323a3136e60d  status: IN_PROGRESS         op: TableRangeOp     
locked: []              locking: [W:cz]          top: TableRangeOp
txid: 044015732e97eec1  status: IN_PROGRESS         op: CompactRange     
locked: []              locking: [R:cz]          top: CompactRange
txid: 6ce9dd63f9d51448  status: IN_PROGRESS         op: CompactRange     
locked: []              locking: [R:cz]          top: CompactRange
txid: 417cb9b60e44ecd9  status: IN_PROGRESS         op: TableRangeOp     
locked: []              locking: [W:cz]          top: TableRangeOp
txid: 5e7c5284a4677d6c  status: IN_PROGRESS         op: DeleteTable      
locked: []              locking: [W:cz]          top: DeleteTable
txid: 6633d3d841d66995  status: IN_PROGRESS         op: TableRangeOp     
locked: [W:cz]          locking: []              top: TableRangeOpWait
}}}

Quick check of master logs finds that Fate tx 6633d3d841d66995 did not 
complete.  Here's what the master saw for this transaction:

The transaction is seeded, and a write lock is put into the queue; but there 
are other operations going on
{{{
30 20:52:29,387 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 4 
userData 6633d3d841d66995 lockType WRITE
}}}

For example, there was a read lock for transaction 56fada3359bfb4c7
{{{
30 20:52:29,391 [tableOps.Utils] INFO : table cz (56fada3359bfb4c7) unlocked 
for read
}}}

And yet another write lock ahead of us for transaction 62c3cecaf387398e
{{{
30 20:52:29,486 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) locked for 
write operation: ONLINE
30 20:52:29,488 [tables.TableManager] DEBUG: Transitioning state for table cz 
from OFFLINE to ONLINE
30 20:52:29,491 [tables.TableManager] DEBUG: State transition to ONLINE @ 
WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/accumulo/88cd0f63-a36a-4218-86b1-9ba1d2cccf08/tables/cz/state
30 20:52:29,491 [master.EventCoordinator] INFO : Table state in zookeeper 
changed for cz to ONLINE
30 20:52:29,494 [tableOps.Utils] INFO : table cz (62c3cecaf387398e) unlocked 
for write
}}}

So now our table is in the online state... but the tablets may not be (yet). 
Our Fate operation gets the write lock and starts to execute
{{{
30 20:52:29,494 [tableOps.ChangeTableState] DEBUG: Changed table state cz ONLINE
30 20:52:29,494 [master.EventCoordinator] INFO : Set table state of cz to ONLINE
30 20:52:29,495 [tableOps.Utils] INFO : table cz (6633d3d841d66995) locked for 
write operation: MERGE
}}}

Hmm... this seems strange... we went from STARTED directly to MERGING: no 
OFFLINE, no CHOPPING.  Those are very strange MergeStats, too.  I'm not 100% 
sure that these merge stats go with this operation, but the timing suggests 
that it does (need to fix that log message).
{{{
30 20:52:29,500 [master.EventCoordinator] INFO : Merge state of 
cz;7c9131203e34514c;39ceaf91dd1db6ef set to STARTED
30 20:52:29,516 [state.MergeStats] INFO : 0 are hosted, total 0
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are chopped, 0 are offline
30 20:52:29,517 [state.MergeStats] INFO : 0 tablets are unassigned
30 20:52:29,535 [master.EventCoordinator] INFO : Merge state of 
cz;7c9131203e34514c;39ceaf91dd1db6ef set to MERGING
}}}

Looks like the tablets were offline, too, because the main master loop is 
trying to assign them:
{{{
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet 
cz;08498ad358a45bb0;050754dd3cc35c16=192.168.117.4:9997[134d7425fc8959e]
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet 
cz;050754dd3cc35c16<=192.168.117.5:9997[334c843791cb78b]
30 20:52:29,598 [master.Master] DEBUG: Normal Tablets assigning tablet 
cz;137d382d52b8fbd9;08498ad358a45bb0=192.168.117.6:9997[134d7425fc8959f]
30 20:52:29,615 [master.EventCoordinator] INFO : tablet 
cz;08498ad358a45bb0;050754dd3cc35c16 was loaded
30 20:52:29,616 [master.EventCoordinator] INFO : tablet 
cz;137d382d52b8fbd9;08498ad358a45bb0 was loaded
}}}


And, just to make this more interesting, a tablet splits shortly after being 
loaded:
{{{
30 20:52:29,603 [state.MergeStats] INFO : 1 tablets are unassigned
30 20:52:29,688 [master.EventCoordinator] INFO : tablet cz<;137d382d52b8fbd9 
was unloaded
30 20:52:30,081 [master.EventCoordinator] INFO : 192.168.117.9:9997 reported 
split cz;08498ad358a45bb0;050754dd3cc35c16, cz;137d382d52b8fbd9;08498ad358a45bb0
}}}

Master takes some of the tablets offline
{{{
30 20:52:30,083 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< 
was unloaded
30 20:52:30,083 [master.EventCoordinator] INFO : tablet 
cz;08498ad358a45bb0;050754dd3cc35c16 was unloaded
30 20:52:30,084 [master.EventCoordinator] INFO : tablet 
cz;137d382d52b8fbd9;08498ad358a45bb0 was unloaded

But not all of them
{{{
30 20:52:30,445 [master.EventCoordinator] INFO : tablet cz;050754dd3cc35c16< 
was loaded
}}}

The !METADATA status
{{{
!METADATA> scan -b cz; -e d
cz;050754dd3cc35c16 file:/b-00006eb/I00006ec.rf []    50848,0
cz;050754dd3cc35c16 file:/t-00006p2/F0000qra.rf []    2936,660
cz;050754dd3cc35c16 file:/t-00006p2/F00023p9.rf []    3176,720
cz;050754dd3cc35c16 last:334c843791cb78b []    192.168.117.5:9997
cz;050754dd3cc35c16 loc:334c843791cb78b []    192.168.117.5:9997
cz;050754dd3cc35c16 srv:dir []    /t-00006p2
cz;050754dd3cc35c16 srv:flush []    0
cz;050754dd3cc35c16 srv:lock []    
tservers/192.168.117.5:9997/zlock-0000000000$334c843791cb78b
cz;050754dd3cc35c16 srv:time []    M1327993277526
cz;050754dd3cc35c16 ~tab:~pr []    \x00
cz;08498ad358a45bb0 file:/b-00006eb/I00006ec.rf []    47216,0
cz;08498ad358a45bb0 file:/t-00006p3/F0000qj9.rf []    1883,400
cz;08498ad358a45bb0 file:/t-00006p3/F00022u1.rf []    1717,360
cz;08498ad358a45bb0 last:134d7425fc8959e []    192.168.117.4:9997
cz;08498ad358a45bb0 loc:134d7425fc8959e []    192.168.117.4:9997
cz;08498ad358a45bb0 srv:dir []    /t-00006p3
cz;08498ad358a45bb0 srv:flush []    0
cz;08498ad358a45bb0 srv:lock []    
tservers/192.168.117.4:9997/zlock-0000000000$134d7425fc8959e
cz;08498ad358a45bb0 srv:time []    M1327993277506
cz;08498ad358a45bb0 ~tab:~pr []    \x01050754dd3cc35c16
cz;137d382d52b8fbd9 file:/b-00006eb/I00006ec.rf []    118041,0
cz;137d382d52b8fbd9 file:/t-00006os/F0000quc.rf []    6017,1430
cz;137d382d52b8fbd9 file:/t-00006os/F00023gh.rf []    5409,1270
cz;137d382d52b8fbd9 last:134d7425fc8959f []    192.168.117.6:9997
cz;137d382d52b8fbd9 loc:134d7425fc8959f []    192.168.117.6:9997
cz;137d382d52b8fbd9 srv:dir []    /t-00006os
cz;137d382d52b8fbd9 srv:flush []    0
cz;137d382d52b8fbd9 srv:lock []    
tservers/192.168.117.6:9997/zlock-0000000000$134d7425fc8959f
cz;137d382d52b8fbd9 srv:time []    M1327993277515
cz;137d382d52b8fbd9 ~tab:~pr []    \x0108498ad358a45bb0
cz< chopped:chopped []    chopped
cz< file:/b-00006eb/I00006ec.rf []    1210190,0
cz< srv:compact []    1
cz< srv:dir []    /default_tablet
cz< srv:lock []    tservers/192.168.117.10:9997/zlock-0000000000$334c843791cb78f
cz< srv:time []    M0
cz< ~tab:~pr []    \x01137d382d52b8fbd9
}}}

4 tablets, consistent splits, 3 are online, the offline tablet is chopped.  The 
offline tablet doesn't have a last location, which is strange.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (ACCUMULO-356) merge failed to complete

Reply via email to