[ 
https://issues.apache.org/jira/browse/ACCUMULO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner resolved ACCUMULO-315.
-----------------------------------

    Resolution: Fixed
    
> Hole in metadata table occurred during random walk test
> -------------------------------------------------------
>
>                 Key: ACCUMULO-315
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-315
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>         Environment: Running 1.4.0 SNAPSHOT on 10 node cluster.
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>            Priority: Critical
>             Fix For: 1.4.0
>
>
> While running the random walk test a hole in the metadata table occurred.  A 
> client tried to delete the table with the whole and the fate op got stuck.  
> Was continually seeing the following in the master logs.
> {noformat}
> 14 00:02:11,273 [tableOps.CleanUp] DEBUG: Still waiting for table to be 
> deleted: 4ct locationState: 
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef@(null,xxx.xxx.xxx.xxx:9997[134d7425fc503e1],null)
> {noformat}
> The metadata table contained the following.  Tablet 4ct;4d2d3be2823b0bf4 had 
> a location.
> {noformat}
> 4ct;262249211a62cd6f ~tab:~pr []    \x011819e56edae21302
> 4ct;27b693c626c2d4ef ~tab:~pr []    \x01262249211a62cd6f
> 4ct;43422047c78fa52b ~tab:~pr []    \x0141ea825af0f262d9
> 4ct;4d2d3be2823b0bf4 ~tab:~pr []    \x0127b693c626c2d4ef
> 4ct;4f89df61392bb311 ~tab:~pr []    \x014d2d3be2823b0bf4
> {noformat}
> Found the following events on a tablet server.
> {noformat}
> #the tablet server events below are caused by the delete range operation
> 13 21:36:04,287 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;262249211a62cd6f split 
> 4ct;27b693c626c2d4ef;262249211a62cd6f 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef
> 13 21:36:04,369 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef split 
> 4ct;41ea825af0f262d9;27b693c626c2d4ef 4ct;4d2d3be2823b0bf4;41ea825af0f262d9
> 13 21:36:04,370 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 opened
> 13 21:36:06,141 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 closed
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for low split 
> 4ct;43422047c78fa52b;41ea825af0f262d9  [/t-0001cdi/F0001bmw.rf, 
> /t-0001cdi/F0001bn1.rf]
> 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for high split 
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b  [/t-0001cdi/A0001cef.rf, 
> /t-0001cdi/F0001bmw.rf, /t-0001cdi/F0001bn1.rf]
> #split from other random walker
> 13 21:36:06,351 [tabletserver.Tablet] TABLET_HIST: 
> 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 split 
> 4ct;43422047c78fa52b;41ea825af0f262d9 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> {noformat}
> The following events occurred on the master and overlap in time with the 
> split on the tablet server.
> {noformat}
> 13 21:36:06,312 [master.EventCoordinator] INFO : Merge state of 
> 4ct;41ea825af0f262d9;27b693c626c2d4ef set to MERGING
> 13 21:36:06,312 [master.Master] DEBUG: Deleting tablets for 
> 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,316 [master.Master] DEBUG: Found following tablet 
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b
> 13 21:36:06,317 [master.Master] DEBUG: Making file deletion entries for 
> 4ct;41ea825af0f262d9;27b693c626c2d4ef
> 13 21:36:06,325 [master.Master] DEBUG: Removing metadata table entries in 
> range [4ct;27b693c626c2d4ef%00; : [] 9223372036854775807 
> false,4ct;41ea825af0f262d9%00; : [] 9223372036854775807 false)
> 13 21:36:06,331 [master.Master] DEBUG: Updating prevRow of 
> 4ct;4d2d3be2823b0bf4;43422047c78fa52b to 27b693c626c2d4ef
> {noformat}
> After many hours of debugging Eric and I figured out what was going on.  Two 
> random walkers were running the concurrent test.  One client initiated a 
> delete range on table id 4ct for the range 27b693c626c2d4ef to 
> 41ea825af0f262d9.  While this delete range operation was occurring another 
> client add the split point 43422047c78fa52b.  The master read the metadata 
> table while the split was occurring and got inconsistent/incomplete 
> information about what tablets related to the delete range operation were 
> online.  It assumed the required tablets were offline when they were not.  
> The log messages above show that the split and updating of the prevRow by the 
> master overlap in time.
> We think the best solution is to ensure that scans of the metadata table for 
> merges and delete range are consistent with respect to end row and prev end 
> row matching.  Can not consider tablets individually.  Must ensure the 
> portion of the metadata table under consideration forms a proper sorted 
> linked list.      

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to