[ https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189883#comment-13189883 ]
Keith Turner commented on ACCUMULO-334: --------------------------------------- Ran the test overnight and saw the bug again. Have not looked into it in detail. Billie helped me find the offendening ranges easily w/ the following shell command. John assisted w/ the not zero. {noformat} /bin/accumulo shell -u root -p secret -e 'scan -t bulk_xxx_4807_1327038843548 -np' | egrep -C1 "[^0]$" | grep -C1 " 0" r17ecb cf:009 [] 0 r17ecc cf:000 [] 1 -- r18132 cf:009 [] 1 r18133 cf:000 [] 0 {noformat} I thought this was cool enough to post. Yesterday there were two bad ranges. Today there is one. Although yesterday I only looked for -1, so there could have been more. This command looks for any non-zero values. > Bulk random walk test failed > ---------------------------- > > Key: ACCUMULO-334 > URL: https://issues.apache.org/jira/browse/ACCUMULO-334 > Project: Accumulo > Issue Type: Bug > Components: test > Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 > node cluster > Reporter: Keith Turner > Assignee: Keith Turner > Fix For: 1.4.0 > > > The bulk random walk test failed while running on a 10 node cluster w/ the > following error message. > {noformat} > 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0 > 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk > java.lang.Exception: Error running node Bulk.xml > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253) > at > org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61) > at > org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.accumulo.start.Main$1.run(Main.java:89) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.lang.Exception: Error running node bulk.Verify > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253) > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249) > ... 8 more > Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 > false -1 > at > org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51) > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249) > ... 9 more > {noformat} > Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had > -1 values. There was a tablet that overlapped the first range of -1 rows > exactly 268;r0edd9;r0d645. This tablet had only the following activity on a > tablet server and was then merged out of existence. The merge operation was > 268;r10eff;r093b1. > {noformat} > 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split > 268;r0edd9;r0d645 [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] > 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split > 268;r0edd9;r0d645 268;r0f055;r0edd9 > 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened > 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import > /b-0001azi/I0001azm.rf 17138 0 > 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 > [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp > 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import > /b-0001azx/I0001azy.rf 16620 0 > 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 > 181,080 read | 60,360 written | 553,761 entries/sec | 0.327 secs > 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC > [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf > 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 > [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp > 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 > 60,360 read | 60,360 written | 534,159 entries/sec | 0.113 secs > 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC > [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf > 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true > queueMinC=false disableWrites=false) 268;r0edd9;r0d645 > 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true > completeClose=true) 268;r0edd9;r0d645 > 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed > 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning > 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null) > 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645 > 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645 > {noformat} > For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the > split point r0f055. Howerver, there is no split point corresponding to > r10467. All of the tablets w/ a split of r0f055 lived on one tablet server. > {noformat} > 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split > 268;r0f055;r0d645 268<;r0f055 > 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened > 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened > 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split > 268;r11da6;r0f055 268<;r11da6 > 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened > 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split > 268;r0edd9;r0d645 268;r0f055;r0edd9 > 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened > 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split > 268;r0f622;r0f055 268;r11da6;r0f622 > 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened > {noformat} > All of the tablets mentioned so far were all merged away in the same merge > operation, making this operation a possible place were data loss occurred. > However, I can not pinpoint the issue at this point in time. Below is a > little info about the merge from the master logs showing which tablets were > involved in the merge. > {noformat} > 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of > 268;r10eff;r093b1 set to WAITING_FOR_CHOPPED > 19 00:05:30,677 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14 > {noformat} > When this test verifies its data and detects data loss, there is no easy way > to determine at what time the data loss occurred. It might be useful to > modify the data in the bulk test such that it is easier to determine the time > when data was lost. For example the continuous ingest test creates linked > list and it is possible to determine tight time bounds when a node was > ingested. However that may change the nature of this test and the bugs that > it might find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira