[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839837#comment-13839837 ] stack commented on HBASE-10079: --- +1 Above comment on the race made for a good read. Thanks. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Assignee: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1, 0.99.0 > > Attachments: 10079.v1.patch, hbase-10079.v2.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839836#comment-13839836 ] Jonathan Hsieh commented on HBASE-10079: This was useful. Specifically the "don't consume the buffer" section was applicable. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Assignee: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1, 0.99.0 > > Attachments: 10079.v1.patch, hbase-10079.v2.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839827#comment-13839827 ] Jonathan Hsieh commented on HBASE-10079: {code} /** * Check that the object does not exist already. There are two reasons for creating the objects * only once: * 1) With 100K regions, the table names take ~20MB. * 2) Equals becomes much faster as it's resolved with a reference and an int comparison. */ 01 private static TableName createTableNameIfNecessary(ByteBuffer bns, ByteBuffer qns) { 02 for (TableName tn : tableCache) { 03 if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) { 04 return tn; 05 } 06 } 07 08 TableName newTable = new TableName(bns, qns); 09 if (tableCache.add(newTable)) { // Adds the specified element if it is not already present 10 return newTable; 11} else { 12 // Someone else added it. Let's find it. 13 for (TableName tn : tableCache) { 14if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(), bns)) { 15 return tn; 16} 17 } 18} 19 20throw new IllegalStateException(newTable + " was supposed to be in the cache"); 21 } {code} Here's the race: We have two concurrent calls to createTableNameIfNecessary to the same namespace (which gets wrapped and becomes bns) and table qualifier (which gets wrapped and becomes qns) -- ns=default and tn=test in my rig's case. Thread one executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor. Thread two executes to line 08. bns and qns are consumed by the get's in the TableName(BB,BB) constructor. Thread two returns true at line 09, and exits returns newTable at line 10. Thread one returns false since Thread two's TableName made it in. It jumps and continues executing at line 12 Thread one's at line 14's first Bytes.equals methods compares the byte[] tn.getQualifier against qns (which is a consumed BB, and thus has no more data on get). This essentially always will fail. Thread one loops throw, falls out, and then throws the IllegalStateException. So anytime we get to line 14, we'll fail. Solution is to make sure the constructor dups bns and qns before extraction the byte[]'s. Patch coming. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Assignee: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1, 0.99.0 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839793#comment-13839793 ] Jonathan Hsieh commented on HBASE-10079: This also jives with it now showing up on 0.96.0 and showing up on 0.96.1 -- HBASE-9976 was committed recently and between the release and release candidate. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Assignee: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1, 0.99.0 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839791#comment-13839791 ] Jonathan Hsieh commented on HBASE-10079: [~apurtell] Yes. also added 0.99 [~lhofhansl] I'm pretty sure that the TestAtomicOperation caught the the but I thought had regressed. This is definitely a race in TableName's caching mechanism. Our friend [~tlipcon] is fairly certain it is a ByteBuffer usage problem -- gets on BB's mutate it. I'm trying to come up with a trace that shows the race (I think it has to do wiht Bytes.equals(byte[], BB) since that method actually mutates BB and then restores state afterwards.) > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Assignee: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1, 0.99.0 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839789#comment-13839789 ] Lars Hofhansl commented on HBASE-10079: --- Let's add some flushes to the Increment part of TestAtomicOperation. It should have found this issue. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.98.0, 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839689#comment-13839689 ] Jonathan Hsieh commented on HBASE-10079: [~nkeywal] HBASE-9976 introduces the TableName cache which is the root cause. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839677#comment-13839677 ] Jonathan Hsieh commented on HBASE-10079: Removed HTablePool code and still got a race. {code} Exception in thread "Thread-1" java.lang.IllegalStateException: test was supposed to be in the cache at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337) at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:412) at org.apache.hadoop.hbase.client.HTable.(HTable.java:150) at IncrementBlaster$1.run(IncrementBlaster.java:130) {code} This table cache is the root cause of the race. The testing program has n threads which waits until a rendezvous point before creating independent HTable instances with the same name. It is unreasonable for separate HTable constructors that just so happen to try to open the same table to race like this. Fix should be in the TableName cache. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839643#comment-13839643 ] Jonathan Hsieh commented on HBASE-10079: Hm.. HBASE-6580 deprecates HTablePool and happened when I wasn't paying attention. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839634#comment-13839634 ] Jonathan Hsieh commented on HBASE-10079: I'm having a hard time recreating the jagged counts. I tried reverting patches, and before and after the patch nkeywal provided. I think the flush problem was a red herring where I was biased by the customer problem I was recently working on. When I changed my tests to do 10 increments the pattern I saw really jumped out. Looking at the original numbers from this morning I see the same pattern present with the 25 increments. 80 threads, 25 increments == 3125 increments / thread. count = 246875 != 25 (flush) // one thread failed to start. count = 243750 != 25 (kill) // two threads failed to start. count = 246878 != 25 (kill -9) // one thread failed to start and we had 3 threads that sent increments that succeeded and retried but didn't get an ack because of kill -9). The last one through we off because it wasn't regular but I think the explanation I have makes sense. I'm looking into seeing if my test code is bad (is there TableName documentation I ignoredthat says that the race in the stacktrace is my fault) or if we need to add some synchronization to this createTableNameIfNecessary method. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839554#comment-13839554 ] Jonathan Hsieh commented on HBASE-10079: Here's the dropped "threads" stack dump -- each one of these corresponds to a thread that didn't run. {code} Exception in thread "Thread-58" java.lang.IllegalStateException: test was supposed to be in the cache at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337) at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:385) at org.apache.hadoop.hbase.client.HTable.(HTable.java:165) at org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:39) at org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:271) at org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:201) at org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:180) at IncrementBlaster$1.run(IncrementBlaster.java:131) {code} > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839495#comment-13839495 ] Sergey Shelukhin commented on HBASE-10079: -- +1 > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839491#comment-13839491 ] stack commented on HBASE-10079: --- Patch is good. Nice work Jon. Makes sense this missing lock was exposed by hbase-9963. Pity we didn't catch it in tests previous. Any chance of a test? > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839431#comment-13839431 ] Hadoop QA commented on HBASE-10079: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12617058/10079.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 site{color}. The patch appears to cause mvn site goal to fail. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/8057//console This message is automatically generated. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839351#comment-13839351 ] Nicolas Liochon commented on HBASE-10079: - That's strange. We should lock, and we don't do it in trunk or 0.96 head... > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > Attachments: 10079.v1.patch > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839334#comment-13839334 ] Jonathan Hsieh commented on HBASE-10079: Actually, the current tip of 0.96 (HBASE-9485) doesn't seem to have the flush problem but does have the htable initializaiton problem. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839333#comment-13839333 ] Nicolas Liochon commented on HBASE-10079: - I guess the error is in HBASE-9963. It seems there is an issue in HStore#StoreFlusherImpl#prepare: there is no lock there. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839291#comment-13839291 ] Jonathan Hsieh commented on HBASE-10079: Seems like reverting either HBASE-9963 or HBASE-10014 gets rid of the "jagged" losses due to flushes. However when testing on the tip of 0.96 with the reverts I seem to be losing some threads as the initialize becuase of some sort of race. I'm going to try from the exact point where 0.96.1rc1 was cut to see if it is an a happy place any will investigate the htable initialization problem afterwards. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839233#comment-13839233 ] Jonathan Hsieh commented on HBASE-10079: I tweaked the test and wasn't able to duplicate at the unit test level. I'm looking into reverting a few patches touching memstore/flush area and testing on the cluster (HBASE-9963 and HBASE-10014 seems like candidates) to see if they caused the problem. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839153#comment-13839153 ] Jonathan Hsieh commented on HBASE-10079: TestHRegion#testParallelIncrementWithMemStoreFlush passes on the 0.96 tip The test actually waits for all the increments to be done before flushing (instead of while other increments are happening) so my bet is that it doesn't actually test the race condition. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839114#comment-13839114 ] Jonathan Hsieh commented on HBASE-10079: This was the issue that fixed the problem in 0.94 and 0.95 branches (at the time). It added at test to TestHRegion called testParallelIncrementWithMemStoreFlush that test the situtaion. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839103#comment-13839103 ] Jonathan Hsieh commented on HBASE-10079: Do you the increval rig with the github link in the first comment? No, that a was a quick and dirty program to duplicate a customer issue. I'm in the process of adding flushes to the TestAtomicOperation unit tests that [~lhofhansl] mentioned in the mailing list. I'll be able to bisect find the bug that way. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839102#comment-13839102 ] Sergey Shelukhin commented on HBASE-10079: -- hbase.regionserver.nonces.enabled is the server config setting. Although, during replay, the updates are never blocked if nonces collide. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839099#comment-13839099 ] Sergey Shelukhin commented on HBASE-10079: -- does the writer check for exceptions? can you try disabling nonces, to see if there could be collisions (although I would expect the client to receive exceptions in such cases) > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839085#comment-13839085 ] Jonathan Hsieh commented on HBASE-10079: In 0.96.0: * flush: Not able to reproduce data loss * with kill: Not able to reproduce data loss. had an overcount of 1. * with kill -9: Not able to reproduce data loss. had an overcount of 1. The overcount of 1 is likely a different bug that I think that I'll let slide. Likely the client thought it failed and retried, but it actually made it to the log and increments not being idempotent. So the bug is somewhere between 0.96.0 and 0.96.1rc1. > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-10079) Increments lost after flush
[ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839042#comment-13839042 ] Jonathan Hsieh commented on HBASE-10079: Here's a link to the test programs I used to pull out this bug. It needs to be polished and turned in to an IT test as well as a perf test probably in a separate issue. https://github.com/jmhsieh/hbase/tree/increval > Increments lost after flush > > > Key: HBASE-10079 > URL: https://issues.apache.org/jira/browse/HBASE-10079 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.1 >Reporter: Jonathan Hsieh >Priority: Blocker > Fix For: 0.96.1 > > > Testing 0.96.1rc1. > With one process incrementing a row in a table, we increment single col. We > flush or do kills/kill-9 and data is lost. flush and kill are likely the > same problem (kill would flush), kill -9 may or may not have the same root > cause. > 5 nodes > hadoop 2.1.0 (a pre cdh5b1 hdfs). > hbase 0.96.1 rc1 > Test: 25 increments on a single row an single col with various number of > client threads (IncrementBlaster). Verify we have a count of 25 after > the run (IncrementVerifier). > Run 1: No fault injection. 5 runs. count = 25. on multiple runs. > Correctness verified. 1638 inc/s throughput. > Run 2: flushes table with incrementing row. count = 246875 !=25. > correctness failed. 1517 inc/s throughput. > Run 3: kill of rs hosting incremented row. count = 243750 != 25. > Correctness failed. 1451 inc/s throughput. > Run 4: one kill -9 of rs hosting incremented row. 246878.!= 25. > Correctness failed. 1395 inc/s (including recovery) -- This message was sent by Atlassian JIRA (v6.1#6144)