[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839837#comment-13839837
 ] 

stack commented on HBASE-10079:
---

+1

Above comment on the race made for a good read.  Thanks.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1, 0.99.0
>
> Attachments: 10079.v1.patch, hbase-10079.v2.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839836#comment-13839836
 ] 

Jonathan Hsieh commented on HBASE-10079:


This was useful. Specifically the "don't consume the buffer" section was 
applicable.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1, 0.99.0
>
> Attachments: 10079.v1.patch, hbase-10079.v2.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839827#comment-13839827
 ] 

Jonathan Hsieh commented on HBASE-10079:


{code}
  /**
   * Check that the object does not exist already. There are two reasons for 
creating the objects
   * only once:
   * 1) With 100K regions, the table names take ~20MB.
   * 2) Equals becomes much faster as it's resolved with a reference and an int 
comparison.
   */
01  private static TableName createTableNameIfNecessary(ByteBuffer bns, 
ByteBuffer qns) {
02 for (TableName tn : tableCache) {
03   if (Bytes.equals(tn.getQualifier(), qns) && 
Bytes.equals(tn.getNamespace(), bns)) {
04 return tn;
05   }
06 }
07 
08 TableName newTable = new TableName(bns, qns);
09 if (tableCache.add(newTable)) {  // Adds the specified element if it is 
not already present
10  return newTable;
11} else {
12  // Someone else added it. Let's find it.
13  for (TableName tn : tableCache) {
14if (Bytes.equals(tn.getQualifier(), qns) && 
Bytes.equals(tn.getNamespace(), bns)) {
15  return tn;
16}
17  }
18}
19
20throw new IllegalStateException(newTable + " was supposed to be in the 
cache");
21  }
{code}

Here's the race:

We have two concurrent calls to createTableNameIfNecessary to the same 
namespace (which gets wrapped and becomes bns) and table qualifier (which gets 
wrapped and becomes qns) -- ns=default and 
tn=test in my rig's case.

Thread one executes to line 08.  bns and qns are consumed by the get's in the 
TableName(BB,BB) constructor.
Thread two executes to line 08.  bns and qns are consumed by the get's in the 
TableName(BB,BB) constructor.
Thread two returns true at line 09, and exits returns newTable at line 10.
Thread one returns false since Thread two's TableName made it in.  It jumps and 
continues executing at line 12
Thread one's at line 14's first  Bytes.equals methods compares the byte[] 
tn.getQualifier against qns (which is a consumed BB, and thus has no more data 
on get).  This essentially always will fail.  
Thread one loops throw, falls out, and then throws the IllegalStateException.

So anytime we get to line 14, we'll fail.  

Solution is to make sure the constructor dups bns and qns before extraction the 
byte[]'s.  Patch coming.



> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1, 0.99.0
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839793#comment-13839793
 ] 

Jonathan Hsieh commented on HBASE-10079:


This also jives with it now showing up on 0.96.0 and showing up on 0.96.1 -- 
HBASE-9976 was committed recently and between the release and release candidate.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1, 0.99.0
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839791#comment-13839791
 ] 

Jonathan Hsieh commented on HBASE-10079:


[~apurtell] Yes. also added 0.99

[~lhofhansl]  I'm pretty sure that the TestAtomicOperation caught the the but I 
thought had regressed.  This is definitely a race in TableName's caching 
mechanism.  Our friend [~tlipcon] is fairly certain it is a ByteBuffer usage 
problem  -- gets on BB's mutate it.  I'm trying to come up with a trace that 
shows the race (I think it has to do wiht Bytes.equals(byte[], BB) since that 
method actually mutates BB and then restores state afterwards.)

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Assignee: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1, 0.99.0
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839789#comment-13839789
 ] 

Lars Hofhansl commented on HBASE-10079:
---

Let's add some flushes to the Increment part of TestAtomicOperation. It should 
have found this issue.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.98.0, 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839689#comment-13839689
 ] 

Jonathan Hsieh commented on HBASE-10079:


[~nkeywal] HBASE-9976 introduces the TableName cache which is the root cause.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839677#comment-13839677
 ] 

Jonathan Hsieh commented on HBASE-10079:


Removed HTablePool code and still got a race.

{code}
Exception in thread "Thread-1" java.lang.IllegalStateException: test was 
supposed to be in the cache
at 
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337)
at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:412)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:150)
at IncrementBlaster$1.run(IncrementBlaster.java:130)
{code}

This table cache is the root cause of the race.  The testing program has n 
threads which waits until a rendezvous point before creating independent HTable 
instances  with the same name.  It is unreasonable for separate HTable 
constructors that just so happen to try to open the same table to race like 
this.  Fix should be in the TableName cache.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839643#comment-13839643
 ] 

Jonathan Hsieh commented on HBASE-10079:


Hm.. HBASE-6580 deprecates HTablePool and happened when I wasn't paying 
attention.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839634#comment-13839634
 ] 

Jonathan Hsieh commented on HBASE-10079:


I'm having a hard time recreating the jagged counts.  I tried reverting 
patches, and before and after the patch nkeywal provided.  I think the flush 
problem was a red herring where I was biased by the customer problem I was 
recently working on.

When I changed my tests to do 10 increments the pattern I saw really jumped 
out.  Looking at the original numbers from this morning I see the same pattern 
present with the 25 increments.  

80 threads, 25 increments == 3125 increments / thread.
count = 246875 != 25 (flush)  // one thread failed to start.
count = 243750 != 25 (kill)  // two threads failed to start.  
count = 246878 != 25 (kill -9) // one thread failed to start and we had 3 
threads that sent increments that succeeded and retried but didn't get an ack 
because of kill -9).

The last one through we off because it wasn't regular but I think the 
explanation I have makes sense.

I'm looking into seeing if my test code is bad (is there TableName 
documentation I ignoredthat says  that the race in the stacktrace is my fault) 
or if we need to add some synchronization to this createTableNameIfNecessary 
method.



> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839554#comment-13839554
 ] 

Jonathan Hsieh commented on HBASE-10079:


Here's the dropped "threads" stack dump -- each one of these corresponds to a 
thread that didn't run.

{code}
Exception in thread "Thread-58" java.lang.IllegalStateException: test was 
supposed to be in the cache
at 
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:337)
at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:385)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:165)
at 
org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:39)
at 
org.apache.hadoop.hbase.client.HTablePool.createHTable(HTablePool.java:271)
at 
org.apache.hadoop.hbase.client.HTablePool.findOrCreateTable(HTablePool.java:201)
at 
org.apache.hadoop.hbase.client.HTablePool.getTable(HTablePool.java:180)
at IncrementBlaster$1.run(IncrementBlaster.java:131)
{code}


> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839495#comment-13839495
 ] 

Sergey Shelukhin commented on HBASE-10079:
--

+1

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839491#comment-13839491
 ] 

stack commented on HBASE-10079:
---

Patch is good.  Nice work  Jon.  Makes sense this missing lock was exposed by 
hbase-9963.  Pity we didn't catch it in tests previous.  Any chance of a test?



> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839431#comment-13839431
 ] 

Hadoop QA commented on HBASE-10079:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12617058/10079.v1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 1 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

{color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/8057//console

This message is automatically generated.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839351#comment-13839351
 ] 

Nicolas Liochon commented on HBASE-10079:
-

That's strange. We should lock, and we don't do it in trunk or 0.96 head...

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
> Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839334#comment-13839334
 ] 

Jonathan Hsieh commented on HBASE-10079:


Actually, the current tip of 0.96 (HBASE-9485) doesn't seem to have the flush 
problem but does have the htable initializaiton problem.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839333#comment-13839333
 ] 

Nicolas Liochon commented on HBASE-10079:
-

I guess the error is in HBASE-9963. It seems there is an issue in 
HStore#StoreFlusherImpl#prepare: there is no lock there.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839291#comment-13839291
 ] 

Jonathan Hsieh commented on HBASE-10079:


Seems like reverting either HBASE-9963 or HBASE-10014 gets rid of the "jagged" 
losses due to flushes.  However when testing on the tip of 0.96 with the 
reverts I seem to be losing some threads as the initialize becuase of some sort 
of race.  

I'm going to try from the exact point where 0.96.1rc1 was cut to see if it is 
an a happy place any will investigate the htable initialization problem 
afterwards.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839233#comment-13839233
 ] 

Jonathan Hsieh commented on HBASE-10079:


I tweaked the test and wasn't able to duplicate at the unit test level.  I'm 
looking into reverting a few patches touching memstore/flush area and testing 
on the cluster (HBASE-9963 and HBASE-10014 seems like candidates) to see if 
they caused the problem.



> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839153#comment-13839153
 ] 

Jonathan Hsieh commented on HBASE-10079:


TestHRegion#testParallelIncrementWithMemStoreFlush passes on the 0.96 tip  The 
test actually waits for all the increments to be done before flushing (instead 
of while other increments are happening) so my bet is that it  doesn't actually 
test the race condition.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839114#comment-13839114
 ] 

Jonathan Hsieh commented on HBASE-10079:


This was the issue that fixed the problem in 0.94 and 0.95 branches (at the 
time).  It added at test to TestHRegion called 
testParallelIncrementWithMemStoreFlush that test the situtaion.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839103#comment-13839103
 ] 

Jonathan Hsieh commented on HBASE-10079:


Do you the increval rig with the github link in the first comment? No, that a 
was a quick and dirty program to duplicate a customer issue.  

I'm in the process of adding flushes to the TestAtomicOperation unit tests that 
[~lhofhansl] mentioned in the mailing list.  I'll be able to bisect find the 
bug that way.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839102#comment-13839102
 ] 

Sergey Shelukhin commented on HBASE-10079:
--

hbase.regionserver.nonces.enabled is the server config setting. Although, 
during replay, the updates are never blocked if nonces collide. 

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839099#comment-13839099
 ] 

Sergey Shelukhin commented on HBASE-10079:
--

does the writer check for exceptions? can you try disabling nonces, to see if 
there could be collisions (although I would expect the client to receive 
exceptions in such cases)

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839085#comment-13839085
 ] 

Jonathan Hsieh commented on HBASE-10079:


In 0.96.0:
* flush: Not able to reproduce data loss 
* with kill: Not able to reproduce data loss. had an overcount of 1.
* with kill -9:  Not able to reproduce data loss. had an overcount of 1.

The overcount of 1 is likely a different bug that I think that I'll let slide.  
Likely the client thought it failed and retried, but it actually made it to the 
log and increments not being idempotent.

So the bug is somewhere between 0.96.0 and 0.96.1rc1.

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-10079) Increments lost after flush

2013-12-04 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839042#comment-13839042
 ] 

Jonathan Hsieh commented on HBASE-10079:


Here's a link to the test programs I used to pull out this bug.  It needs to be 
polished and turned in to an IT test as well as a perf test probably in a 
separate issue.
https://github.com/jmhsieh/hbase/tree/increval

> Increments lost after flush 
> 
>
> Key: HBASE-10079
> URL: https://issues.apache.org/jira/browse/HBASE-10079
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.96.1
>Reporter: Jonathan Hsieh
>Priority: Blocker
> Fix For: 0.96.1
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 25 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 25 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 25. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=25.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 25. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 25.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)