[ https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190079#comment-13190079 ]
Keith Turner commented on ACCUMULO-334: --------------------------------------- I have been look into this second occurrence. It is similar to the first. There is one range of data that is wrong [r17ecc,r18132] which perfectly matches a tablet that existed 1hxx;r18132;r17ecb. This tablet was created by a split, was short lived, and was wiped out by a merge. All of the files that existed in 1hxx;r18132;r17ecb when it was closed ended up in the tablet created by the merge. This rules out the merge dropping a file from the tablet. This seems to leave two possibilities. * A bulk import into 1hxx;r18132;r17ecb silently failed, so it did not get a file that other tablets got. * The merge did not chop a tablet. Looking for things that might indicate a failure to bulk import, I found the following. The timing of the following error was immediately before the split that created 1hxx;r18132;r17ecb. But the range seems unrelated. {noformat} 20 06:02:39,064 [client.BulkImporter] ERROR: Map file hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf failed more than three times, giving up. 20 06:02:39,064 [client.BulkImporter] ERROR: hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf -> 1hxx;r11fca;r1011f 20 06:02:39,096 [client.BulkImporter] DEBUG: Copying hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf to /tmp/bulk_aff331e7-73e2-46c3-a669-f35c73f51499_fail/I0001zul.rf {noformat} I looked at the parent tablets files, these files were passed to both siblings and compacted away by both siblings. So none of the files from the parent of 1hxx;r18132;r17ecb ended up in the merged tablet. I did not extend this analysis to tablets other than the parent. > Bulk random walk test failed > ---------------------------- > > Key: ACCUMULO-334 > URL: https://issues.apache.org/jira/browse/ACCUMULO-334 > Project: Accumulo > Issue Type: Bug > Components: test > Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 > node cluster > Reporter: Keith Turner > Assignee: Keith Turner > Fix For: 1.4.0 > > > The bulk random walk test failed while running on a 10 node cluster w/ the > following error message. > {noformat} > 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0 > 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk > java.lang.Exception: Error running node Bulk.xml > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253) > at > org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61) > at > org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.accumulo.start.Main$1.run(Main.java:89) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.lang.Exception: Error running node bulk.Verify > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253) > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249) > ... 8 more > Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 > false -1 > at > org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51) > at > org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249) > ... 9 more > {noformat} > Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had > -1 values. There was a tablet that overlapped the first range of -1 rows > exactly 268;r0edd9;r0d645. This tablet had only the following activity on a > tablet server and was then merged out of existence. The merge operation was > 268;r10eff;r093b1. > {noformat} > 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split > 268;r0edd9;r0d645 [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] > 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split > 268;r0edd9;r0d645 268;r0f055;r0edd9 > 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened > 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import > /b-0001azi/I0001azm.rf 17138 0 > 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 > [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp > 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import > /b-0001azx/I0001azy.rf 16620 0 > 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 > 181,080 read | 60,360 written | 553,761 entries/sec | 0.327 secs > 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC > [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, > /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf > 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 > [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp > 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 > 60,360 read | 60,360 written | 534,159 entries/sec | 0.113 secs > 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC > [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf > 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true > queueMinC=false disableWrites=false) 268;r0edd9;r0d645 > 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true > completeClose=true) 268;r0edd9;r0d645 > 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed > 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning > 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null) > 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645 > 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645 > {noformat} > For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the > split point r0f055. Howerver, there is no split point corresponding to > r10467. All of the tablets w/ a split of r0f055 lived on one tablet server. > {noformat} > 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split > 268;r0f055;r0d645 268<;r0f055 > 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened > 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened > 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split > 268;r11da6;r0f055 268<;r11da6 > 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened > 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split > 268;r0edd9;r0d645 268;r0f055;r0edd9 > 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened > 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split > 268;r0f622;r0f055 268;r11da6;r0f622 > 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened > {noformat} > All of the tablets mentioned so far were all merged away in the same merge > operation, making this operation a possible place were data loss occurred. > However, I can not pinpoint the issue at this point in time. Below is a > little info about the merge from the master logs showing which tablets were > involved in the merge. > {noformat} > 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of > 268;r10eff;r093b1 set to WAITING_FOR_CHOPPED > 19 00:05:30,677 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622 > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b > 19 00:05:30,678 [master.Master] INFO : Asking > xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14 > {noformat} > When this test verifies its data and detects data loss, there is no easy way > to determine at what time the data loss occurred. It might be useful to > modify the data in the bulk test such that it is easier to determine the time > when data was lost. For example the continuous ingest test creates linked > list and it is possible to determine tight time bounds when a node was > ingested. However that may change the nature of this test and the bugs that > it might find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira