[jira] [Commented] (ACCUMULO-334) Bulk random walk test failed

Keith Turner (Commented) (JIRA) Fri, 20 Jan 2012 12:41:10 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190079#comment-13190079
 ]


Keith Turner commented on ACCUMULO-334:
---------------------------------------

I have been look into this second occurrence.  It is similar to the first.  
There is one range of data that is wrong [r17ecc,r18132] which perfectly 
matches a tablet that existed 1hxx;r18132;r17ecb.  This tablet was created by a 
split, was short lived, and was wiped out by a merge.  All of the files that 
existed in 1hxx;r18132;r17ecb when it was closed ended up in the tablet created 
by the merge.  This rules out the merge dropping a file from the tablet.  This 
seems to leave two possibilities.

 * A bulk import into 1hxx;r18132;r17ecb silently failed, so it did not get a 
file that other tablets got.
 * The merge did not chop a tablet.

Looking for things that might indicate a failure to bulk import, I found the 
following.  The timing of the following error was immediately before the split 
that created 1hxx;r18132;r17ecb. But the range seems unrelated.

{noformat}
20 06:02:39,064 [client.BulkImporter] ERROR: Map file 
hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf failed more than 
three times, giving up.
20 06:02:39,064 [client.BulkImporter] ERROR:    
hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf -> 1hxx;r11fca;r1011f
20 06:02:39,096 [client.BulkImporter] DEBUG: Copying 
hdfs://xxx:6093/accumulo/tables/1hxx/b-0001zuj/I0001zul.rf to 
/tmp/bulk_aff331e7-73e2-46c3-a669-f35c73f51499_fail/I0001zul.rf
{noformat}

I looked at the parent tablets files, these files were passed to both siblings 
and compacted away by both siblings.  So none of the files from the parent of 
1hxx;r18132;r17ecb ended up in the merged tablet.  I did not extend this 
analysis to tablets other than the parent.


  
                
> Bulk random walk test failed
> ----------------------------
>
>                 Key: ACCUMULO-334
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-334
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>         Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 
> node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> The bulk random walk test failed while running on a 10 node cluster w/ the 
> following error message.
> {noformat}
> 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0
> 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Bulk.xml
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at 
> org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61)
>         at 
> org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.start.Main$1.run(Main.java:89)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.Exception: Error running node bulk.Verify
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 8 more
> Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 
> false -1
>         at 
> org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51)
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 9 more
> {noformat}
> Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had 
> -1 values.  There was a tablet that overlapped the first range of -1 rows 
> exactly 268;r0edd9;r0d645.  This tablet had only the following activity on a 
> tablet server and was then merged out of existence.  The merge operation was 
> 268;r10eff;r093b1.
> {noformat}
> 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split 
> 268;r0edd9;r0d645  [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf]
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened 
> 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
> /b-0001azi/I0001azm.rf 17138 0
> 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp
> 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
> /b-0001azx/I0001azy.rf 16620 0
> 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
> 181,080 read | 60,360 written | 553,761 entries/sec |  0.327 secs
> 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf
> 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp
> 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
> 60,360 read | 60,360 written | 534,159 entries/sec |  0.113 secs
> 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 268;r0edd9;r0d645
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) 268;r0edd9;r0d645
> 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed
> 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning 
> 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null)
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> {noformat}
> For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the 
> split point r0f055.  Howerver, there is no split point corresponding to 
> r10467. All of the tablets w/ a split of r0f055 lived on one tablet server.  
> {noformat}
> 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split 
> 268;r0f055;r0d645 268<;r0f055
> 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened 
> 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened 
> 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split 
> 268;r11da6;r0f055 268<;r11da6
> 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened 
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened 
> 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split 
> 268;r0f622;r0f055 268;r11da6;r0f622
> 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened 
> {noformat}
> All of the tablets mentioned so far were all merged away in the same merge 
> operation, making this operation a possible place were data loss occurred.  
> However, I can not pinpoint the issue at this point in time.  Below is a 
> little info about the merge from the master logs showing which tablets were 
> involved in the merge.
> {noformat}
> 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of 
> 268;r10eff;r093b1 set to WAITING_FOR_CHOPPED
> 19 00:05:30,677 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14
> {noformat}
> When this test verifies its data and detects data loss, there is no easy way 
> to determine at what time the data loss occurred.  It might be useful to 
> modify the data in the bulk test such that it is easier to determine the time 
> when data was lost.  For example the continuous ingest test creates linked 
> list and it is possible to determine tight time bounds when a node was 
> ingested.  However that may change the nature of this test and the bugs that 
> it might find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-334) Bulk random walk test failed

Reply via email to