[
https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner reassigned ACCUMULO-334:
-------------------------------------
Assignee: Keith Turner (was: John Vines)
> Bulk random walk test failed
> ----------------------------
>
> Key: ACCUMULO-334
> URL: https://issues.apache.org/jira/browse/ACCUMULO-334
> Project: Accumulo
> Issue Type: Bug
> Components: test
> Environment: Running random walk test against 1.4.0-SNAPSHOT on 10
> node cluster
> Reporter: Keith Turner
> Assignee: Keith Turner
> Fix For: 1.4.0
>
>
> The bulk random walk test failed while running on a 10 node cluster w/ the
> following error message.
> {noformat}
> 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0
> 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Bulk.xml
> at
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
> at
> org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61)
> at
> org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.accumulo.start.Main$1.run(Main.java:89)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.Exception: Error running node bulk.Verify
> at
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
> at
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
> ... 8 more
> Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943
> false -1
> at
> org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51)
> at
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
> ... 9 more
> {noformat}
> Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had
> -1 values. There was a tablet that overlapped the first range of -1 rows
> exactly 268;r0edd9;r0d645. This tablet had only the following activity on a
> tablet server and was then merged out of existence. The merge operation was
> 268;r10eff;r093b1.
> {noformat}
> 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split
> 268;r0edd9;r0d645 [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf,
> /t-0001ale/A0001an3.rf]
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened
> 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import
> /b-0001azi/I0001azm.rf 17138 0
> 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf,
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp
> 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import
> /b-0001azx/I0001azy.rf 16620 0
> 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645
> 181,080 read | 60,360 written | 553,761 entries/sec | 0.327 secs
> 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf,
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf
> 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp
> 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645
> 60,360 read | 60,360 written | 534,159 entries/sec | 0.113 secs
> 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true
> queueMinC=false disableWrites=false) 268;r0edd9;r0d645
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true
> completeClose=true) 268;r0edd9;r0d645
> 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed
> 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning
> 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null)
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> {noformat}
> For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the
> split point r0f055. Howerver, there is no split point corresponding to
> r10467. All of the tablets w/ a split of r0f055 lived on one tablet server.
> {noformat}
> 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split
> 268;r0f055;r0d645 268<;r0f055
> 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened
> 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened
> 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split
> 268;r11da6;r0f055 268<;r11da6
> 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened
> 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split
> 268;r0f622;r0f055 268;r11da6;r0f622
> 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened
> {noformat}
> All of the tablets mentioned so far were all merged away in the same merge
> operation, making this operation a possible place were data loss occurred.
> However, I can not pinpoint the issue at this point in time. Below is a
> little info about the merge from the master logs showing which tablets were
> involved in the merge.
> {noformat}
> 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of
> 268;r10eff;r093b1 set to WAITING_FOR_CHOPPED
> 19 00:05:30,677 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b
> 19 00:05:30,678 [master.Master] INFO : Asking
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14
> {noformat}
> When this test verifies its data and detects data loss, there is no easy way
> to determine at what time the data loss occurred. It might be useful to
> modify the data in the bulk test such that it is easier to determine the time
> when data was lost. For example the continuous ingest test creates linked
> list and it is possible to determine tight time bounds when a node was
> ingested. However that may change the nature of this test and the bugs that
> it might find.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira