[ 
https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839827#comment-13839827
 ] 

Jonathan Hsieh commented on HBASE-10079:
----------------------------------------

{code}
  /**
   * Check that the object does not exist already. There are two reasons for 
creating the objects
   * only once:
   * 1) With 100K regions, the table names take ~20MB.
   * 2) Equals becomes much faster as it's resolved with a reference and an int 
comparison.
   */
01  private static TableName createTableNameIfNecessary(ByteBuffer bns, 
ByteBuffer qns) {
02     for (TableName tn : tableCache) {
03       if (Bytes.equals(tn.getQualifier(), qns) && 
Bytes.equals(tn.getNamespace(), bns)) {
04         return tn;
05       }
06     }
07 
08     TableName newTable = new TableName(bns, qns);
09     if (tableCache.add(newTable)) {  // Adds the specified element if it is 
not already present
10      return newTable;
11    } else {
12      // Someone else added it. Let's find it.
13      for (TableName tn : tableCache) {
14        if (Bytes.equals(tn.getQualifier(), qns) && 
Bytes.equals(tn.getNamespace(), bns)) {
15          return tn;
16        }
17      }
18    }
19
20    throw new IllegalStateException(newTable + " was supposed to be in the 
cache");
21  }
{code}

Here's the race:

We have two concurrent calls to createTableNameIfNecessary to the same 
namespace (which gets wrapped and becomes bns) and table qualifier (which gets 
wrapped and becomes qns) -- ns=default and 
tn=test in my rig's case.

Thread one executes to line 08.  bns and qns are consumed by the get's in the 
TableName(BB,BB) constructor.
Thread two executes to line 08.  bns and qns are consumed by the get's in the 
TableName(BB,BB) constructor.
Thread two returns true at line 09, and exits returns newTable at line 10.
Thread one returns false since Thread two's TableName made it in.  It jumps and 
continues executing at line 12
Thread one's at line 14's first  Bytes.equals methods compares the byte[] 
tn.getQualifier against qns (which is a consumed BB, and thus has no more data 
on get).  This essentially always will fail.  
Thread one loops throw, falls out, and then throws the IllegalStateException.

So anytime we get to line 14, we'll fail.  

Solution is to make sure the constructor dups bns and qns before extraction the 
byte[]'s.  Patch coming.



> Increments lost after flush 
> ----------------------------
>
>                 Key: HBASE-10079
>                 URL: https://issues.apache.org/jira/browse/HBASE-10079
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.96.1
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>            Priority: Blocker
>             Fix For: 0.98.0, 0.96.1, 0.99.0
>
>         Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We 
> flush or do kills/kill-9 and data is lost.  flush and kill are likely the 
> same problem (kill would flush), kill -9 may or may not have the same root 
> cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 250000 increments on a single row an single col with various number of 
> client threads (IncrementBlaster).  Verify we have a count of 250000 after 
> the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 250000. on multiple runs.  
> Correctness verified.  1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=250000.  
> correctness failed.  1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 250000. 
> Correctness failed.   1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 250000.  
> Correctness failed. 1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to