[jira] [Commented] (ACCUMULO-334) Bulk random walk test failed

Keith Turner (Commented) (JIRA) Fri, 20 Jan 2012 08:44:11 -0800

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189883#comment-13189883
 ]


Keith Turner commented on ACCUMULO-334:
---------------------------------------

Ran the test overnight and saw the bug again.  Have not looked into it in 
detail.  Billie helped me find the offendening ranges easily w/ the following 
shell command.  John assisted w/ the not zero.

{noformat}
/bin/accumulo shell -u root -p secret -e 'scan -t bulk_xxx_4807_1327038843548 
-np' | egrep -C1 "[^0]$" | grep -C1 " 0"
r17ecb cf:009 []    0
r17ecc cf:000 []    1
--
r18132 cf:009 []    1
r18133 cf:000 []    0
{noformat}

I thought this was cool enough to post.  Yesterday there were two bad ranges.  
Today there is one.  Although yesterday I only looked for -1, so there could 
have been more.  This command looks for any non-zero values.

                
> Bulk random walk test failed
> ----------------------------
>
>                 Key: ACCUMULO-334
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-334
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>         Environment: Running random walk test against 1.4.0-SNAPSHOT on 10 
> node cluster
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.4.0
>
>
> The bulk random walk test failed while running on a 10 node cluster w/ the 
> following error message.
> {noformat}
> 18 23:36:05,167 [bulk.Setup] INFO : Starting bulk test on 459a04a0
> 19 00:24:33,950 [randomwalk.Framework] ERROR: Error during random walk
> java.lang.Exception: Error running node Bulk.xml
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at 
> org.apache.accumulo.server.test.randomwalk.Framework.run(Framework.java:61)
>         at 
> org.apache.accumulo.server.test.randomwalk.Framework.main(Framework.java:114)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.accumulo.start.Main$1.run(Main.java:89)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.Exception: Error running node bulk.Verify
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:253)
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 8 more
> Caused by: java.lang.Exception: Bad key at r0d646 cf:000 [] 1326932285943 
> false -1
>         at 
> org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:51)
>         at 
> org.apache.accumulo.server.test.randomwalk.Module.visit(Module.java:249)
>         ... 9 more
> {noformat}
> Looking at the table the rows [r0d646, r0edd9] and [r0f056, r10467] all had 
> -1 values.  There was a tablet that overlapped the first range of -1 rows 
> exactly 268;r0edd9;r0d645.  This tablet had only the following activity on a 
> tablet server and was then merged out of existence.  The merge operation was 
> 268;r10eff;r093b1.
> {noformat}
> 19 00:05:10,966 [tabletserver.Tablet] DEBUG: Files for low split 
> 268;r0edd9;r0d645  [/b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf]
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 opened 
> 19 00:05:15,029 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
> /b-0001azi/I0001azm.rf 17138 0
> 19 00:05:15,103 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf_tmp
> 19 00:05:15,339 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 import 
> /b-0001azx/I0001azy.rf 16620 0
> 19 00:05:15,651 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
> 181,080 read | 60,360 written | 553,761 entries/sec |  0.327 secs
> 19 00:05:15,661 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
> [/b-0001azi/I0001azm.rf, /b-0001azp/I0001azt.rf, /b-0001azp/I0001azu.rf, 
> /t-0001ale/A0001an3.rf] --> /t-0001apj/A0001bri.rf
> 19 00:05:30,672 [tabletserver.Tablet] DEBUG: Starting MajC 268;r0edd9;r0d645 
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf_tmp
> 19 00:05:30,810 [tabletserver.Compactor] DEBUG: Compaction 268;r0edd9;r0d645 
> 60,360 read | 60,360 written | 534,159 entries/sec |  0.113 secs
> 19 00:05:30,824 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 MajC 
> [/b-0001azx/I0001azy.rf] --> /t-0001apj/C0001brn.rf
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: initiateClose(saveState=true 
> queueMinC=false disableWrites=false) 268;r0edd9;r0d645
> 19 00:05:30,943 [tabletserver.Tablet] DEBUG: completeClose(saveState=true 
> completeClose=true) 268;r0edd9;r0d645
> 19 00:05:30,947 [tabletserver.Tablet] TABLET_HIST: 268;r0edd9;r0d645 closed
> 19 00:05:30,947 [tabletserver.TabletServer] DEBUG: Unassigning 
> 268;r0edd9;r0d645@(null,xxx.xxx.xxx.xxx:9997[134d7425fc59413],null)
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> 19 00:05:30,949 [tabletserver.TabletServer] INFO : unloaded 268;r0edd9;r0d645
> {noformat}
> For the second range of -1 values [r0f056, r10467], r0f056 corresponds to the 
> split point r0f055.  Howerver, there is no split point corresponding to 
> r10467. All of the tablets w/ a split of r0f055 lived on one tablet server.  
> {noformat}
> 19 00:02:21,262 [tabletserver.Tablet] TABLET_HIST: 268<;r0d645 split 
> 268;r0f055;r0d645 268<;r0f055
> 19 00:02:21,263 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 opened 
> 19 00:02:21,264 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 opened 
> 19 00:02:44,504 [tabletserver.Tablet] TABLET_HIST: 268<;r0f055 split 
> 268;r11da6;r0f055 268<;r11da6
> 19 00:02:44,505 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 opened 
> 19 00:05:10,974 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0d645 split 
> 268;r0edd9;r0d645 268;r0f055;r0edd9
> 19 00:05:10,975 [tabletserver.Tablet] TABLET_HIST: 268;r0f055;r0edd9 opened 
> 19 00:05:15,023 [tabletserver.Tablet] TABLET_HIST: 268;r11da6;r0f055 split 
> 268;r0f622;r0f055 268;r11da6;r0f622
> 19 00:05:15,024 [tabletserver.Tablet] TABLET_HIST: 268;r0f622;r0f055 opened 
> {noformat}
> All of the tablets mentioned so far were all merged away in the same merge 
> operation, making this operation a possible place were data loss occurred.  
> However, I can not pinpoint the issue at this point in time.  Below is a 
> little info about the merge from the master logs showing which tablets were 
> involved in the merge.
> {noformat}
> 19 00:05:30,616 [master.EventCoordinator] INFO : Merge state of 
> 268;r10eff;r093b1 set to WAITING_FOR_CHOPPED
> 19 00:05:30,677 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r09927;r0903a
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940c] to chop 268;r0ca9e;r09927
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc5940a] to chop 268;r0d2b5;r0ca9e
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59412] to chop 268;r0d645;r0d2b5
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0edd9;r0d645
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f055;r0edd9
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f622;r0f055
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r0f68b;r0f622
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r10c14;r0f68b
> 19 00:05:30,678 [master.Master] INFO : Asking 
> xxx.xxx.xxx.xxx:9997[134d7425fc59413] to chop 268;r110f7;r10c14
> {noformat}
> When this test verifies its data and detects data loss, there is no easy way 
> to determine at what time the data loss occurred.  It might be useful to 
> modify the data in the bulk test such that it is easier to determine the time 
> when data was lost.  For example the continuous ingest test creates linked 
> list and it is possible to determine tight time bounds when a node was 
> ingested.  However that may change the nature of this test and the bugs that 
> it might find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-334) Bulk random walk test failed

Reply via email to