[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839634#comment-13839634 ]
Jonathan Hsieh commented on HBASE-10079: ---------------------------------------- I'm having a hard time recreating the jagged counts. I tried reverting patches, and before and after the patch nkeywal provided. I think the flush problem was a red herring where I was biased by the customer problem I was recently working on. When I changed my tests to do 100000 increments the pattern I saw really jumped out. Looking at the original numbers from this morning I see the same pattern present with the 250000 increments. 80 threads, 250000 increments == 3125 increments / thread. count = 246875 != 250000 (flush) // one thread failed to start. count = 243750 != 250000 (kill) // two threads failed to start. count = 246878 != 250000 (kill -9) // one thread failed to start and we had 3 threads that sent increments that succeeded and retried but didn't get an ack because of kill -9). The last one through we off because it wasn't regular but I think the explanation I have makes sense. I'm looking into seeing if my test code is bad (is there TableName documentation I ignoredthat says that the race in the stacktrace is my fault) or if we need to add some synchronization to this createTableNameIfNecessary method. > Increments lost after flush > ---------------------------- > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 0.96.1 > Reporter: Jonathan Hsieh > Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 250000 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 250000 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 250000. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=250000. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 250000. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 250000. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)