[ https://issues.apache.org/jira/browse/ACCUMULO-315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Keith Turner resolved ACCUMULO-315. ----------------------------------- Resolution: Fixed > Hole in metadata table occurred during random walk test > ------------------------------------------------------- > > Key: ACCUMULO-315 > URL: https://issues.apache.org/jira/browse/ACCUMULO-315 > Project: Accumulo > Issue Type: Bug > Components: master, tserver > Environment: Running 1.4.0 SNAPSHOT on 10 node cluster. > Reporter: Keith Turner > Assignee: Keith Turner > Priority: Critical > Fix For: 1.4.0 > > > While running the random walk test a hole in the metadata table occurred. A > client tried to delete the table with the whole and the fate op got stuck. > Was continually seeing the following in the master logs. > {noformat} > 14 00:02:11,273 [tableOps.CleanUp] DEBUG: Still waiting for table to be > deleted: 4ct locationState: > 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef@(null,xxx.xxx.xxx.xxx:9997[134d7425fc503e1],null) > {noformat} > The metadata table contained the following. Tablet 4ct;4d2d3be2823b0bf4 had > a location. > {noformat} > 4ct;262249211a62cd6f ~tab:~pr [] \x011819e56edae21302 > 4ct;27b693c626c2d4ef ~tab:~pr [] \x01262249211a62cd6f > 4ct;43422047c78fa52b ~tab:~pr [] \x0141ea825af0f262d9 > 4ct;4d2d3be2823b0bf4 ~tab:~pr [] \x0127b693c626c2d4ef > 4ct;4f89df61392bb311 ~tab:~pr [] \x014d2d3be2823b0bf4 > {noformat} > Found the following events on a tablet server. > {noformat} > #the tablet server events below are caused by the delete range operation > 13 21:36:04,287 [tabletserver.Tablet] TABLET_HIST: > 4ct;4d2d3be2823b0bf4;262249211a62cd6f split > 4ct;27b693c626c2d4ef;262249211a62cd6f 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef > 13 21:36:04,369 [tabletserver.Tablet] TABLET_HIST: > 4ct;4d2d3be2823b0bf4;27b693c626c2d4ef split > 4ct;41ea825af0f262d9;27b693c626c2d4ef 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 > 13 21:36:04,370 [tabletserver.Tablet] TABLET_HIST: > 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 opened > 13 21:36:06,141 [tabletserver.Tablet] TABLET_HIST: > 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 closed > 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for low split > 4ct;43422047c78fa52b;41ea825af0f262d9 [/t-0001cdi/F0001bmw.rf, > /t-0001cdi/F0001bn1.rf] > 13 21:36:06,142 [tabletserver.Tablet] DEBUG: Files for high split > 4ct;4d2d3be2823b0bf4;43422047c78fa52b [/t-0001cdi/A0001cef.rf, > /t-0001cdi/F0001bmw.rf, /t-0001cdi/F0001bn1.rf] > #split from other random walker > 13 21:36:06,351 [tabletserver.Tablet] TABLET_HIST: > 4ct;4d2d3be2823b0bf4;41ea825af0f262d9 split > 4ct;43422047c78fa52b;41ea825af0f262d9 4ct;4d2d3be2823b0bf4;43422047c78fa52b > {noformat} > The following events occurred on the master and overlap in time with the > split on the tablet server. > {noformat} > 13 21:36:06,312 [master.EventCoordinator] INFO : Merge state of > 4ct;41ea825af0f262d9;27b693c626c2d4ef set to MERGING > 13 21:36:06,312 [master.Master] DEBUG: Deleting tablets for > 4ct;41ea825af0f262d9;27b693c626c2d4ef > 13 21:36:06,316 [master.Master] DEBUG: Found following tablet > 4ct;4d2d3be2823b0bf4;43422047c78fa52b > 13 21:36:06,317 [master.Master] DEBUG: Making file deletion entries for > 4ct;41ea825af0f262d9;27b693c626c2d4ef > 13 21:36:06,325 [master.Master] DEBUG: Removing metadata table entries in > range [4ct;27b693c626c2d4ef%00; : [] 9223372036854775807 > false,4ct;41ea825af0f262d9%00; : [] 9223372036854775807 false) > 13 21:36:06,331 [master.Master] DEBUG: Updating prevRow of > 4ct;4d2d3be2823b0bf4;43422047c78fa52b to 27b693c626c2d4ef > {noformat} > After many hours of debugging Eric and I figured out what was going on. Two > random walkers were running the concurrent test. One client initiated a > delete range on table id 4ct for the range 27b693c626c2d4ef to > 41ea825af0f262d9. While this delete range operation was occurring another > client add the split point 43422047c78fa52b. The master read the metadata > table while the split was occurring and got inconsistent/incomplete > information about what tablets related to the delete range operation were > online. It assumed the required tablets were offline when they were not. > The log messages above show that the split and updating of the prevRow by the > master overlap in time. > We think the best solution is to ensure that scans of the metadata table for > merges and delete range are consistent with respect to end row and prev end > row matching. Can not consider tablets individually. Must ensure the > portion of the metadata table under consideration forms a proper sorted > linked list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira