[ 
https://issues.apache.org/jira/browse/HBASE-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-18074.
---------------------------
    Resolution: Invalid

Resolving as invalid. Misreading on my part and we should be aggregating back 
on the RowLockContext if a batch is made up of many mutations all of the same 
row -- they should all get the same lock instance.

> HBASE-12751 dropped optimization in doMiniBatch; we take lock per mutation 
> rather than one per batch
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18074
>                 URL: https://issues.apache.org/jira/browse/HBASE-18074
>             Project: HBase
>          Issue Type: Bug
>          Components: Performance
>            Reporter: stack
>            Assignee: stack
>
> HBASE-12751 did this:
> {code}
> ...
>          // If we haven't got any rows in our batch, we should block to
>          // get the next one.
> -        boolean shouldBlock = numReadyToWrite == 0;
>          RowLock rowLock = null;
>          try {
> -          rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
> +          rowLock = getRowLock(mutation.getRow(), true);
>          } catch (IOException ioe) {
>            LOG.warn("Failed getting lock in batch put, row="
>              + Bytes.toStringBinary(mutation.getRow()), ioe);
>          }
>          if (rowLock == null) {
>            // We failed to grab another lock
> ..
> {code}
> In old codebase, getRowLock with a true meant do not wait on row lock. In the 
> HBASE-12751 codebase, the flag is read/write. So, we get a read lock on every 
> mutation in the batch. If ten mutations in a batch on average, then we'll 10x 
> the amount of locks.
> I'm in here because interesting case where increments and batch going into 
> same row seem to backup and stall trying to get locks. Looks like this where 
> all handlers are one of either of the below:
> {code}
> "RpcServer.FifoWFPBQ.default.handler=190,queue=10,port=60020" #243 daemon 
> prio=5 os_prio=0 tid=0x00007fbb58691800 nid=0x2d2527 waiting on condition 
> [0x00007fbb4ca49000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00000007c6001b38> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7453)
> ...
> {code}
> {code}
> "RpcServer.FifoWFPBQ.default.handler=180,queue=0,port=60020" #233 daemon 
> prio=5 os_prio=0 tid=0x00007fbb586ed800 nid=0x2d251d waiting on condition 
> [0x00007fbb4d453000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0000000354976c00> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3017)
> ...
> {code}
> It gets so bad it looks like deadlock but if you give it a while, we move on 
> (I put it down to safe point giving a misleading view on what is happening).
> Let me put back the optimization.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to