[ https://issues.apache.org/jira/browse/HBASE-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack resolved HBASE-18074. --------------------------- Resolution: Invalid Resolving as invalid. Misreading on my part and we should be aggregating back on the RowLockContext if a batch is made up of many mutations all of the same row -- they should all get the same lock instance. > HBASE-12751 dropped optimization in doMiniBatch; we take lock per mutation > rather than one per batch > ---------------------------------------------------------------------------------------------------- > > Key: HBASE-18074 > URL: https://issues.apache.org/jira/browse/HBASE-18074 > Project: HBase > Issue Type: Bug > Components: Performance > Reporter: stack > Assignee: stack > > HBASE-12751 did this: > {code} > ... > // If we haven't got any rows in our batch, we should block to > // get the next one. > - boolean shouldBlock = numReadyToWrite == 0; > RowLock rowLock = null; > try { > - rowLock = getRowLockInternal(mutation.getRow(), shouldBlock); > + rowLock = getRowLock(mutation.getRow(), true); > } catch (IOException ioe) { > LOG.warn("Failed getting lock in batch put, row=" > + Bytes.toStringBinary(mutation.getRow()), ioe); > } > if (rowLock == null) { > // We failed to grab another lock > .. > {code} > In old codebase, getRowLock with a true meant do not wait on row lock. In the > HBASE-12751 codebase, the flag is read/write. So, we get a read lock on every > mutation in the batch. If ten mutations in a batch on average, then we'll 10x > the amount of locks. > I'm in here because interesting case where increments and batch going into > same row seem to backup and stall trying to get locks. Looks like this where > all handlers are one of either of the below: > {code} > "RpcServer.FifoWFPBQ.default.handler=190,queue=10,port=60020" #243 daemon > prio=5 os_prio=0 tid=0x00007fbb58691800 nid=0x2d2527 waiting on condition > [0x00007fbb4ca49000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007c6001b38> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115) > at > org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171) > at > org.apache.hadoop.hbase.regionserver.HRegion.doIncrement(HRegion.java:7453) > ... > {code} > {code} > "RpcServer.FifoWFPBQ.default.handler=180,queue=0,port=60020" #233 daemon > prio=5 os_prio=0 tid=0x00007fbb586ed800 nid=0x2d251d waiting on condition > [0x00007fbb4d453000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0000000354976c00> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871) > at > org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5171) > at > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3017) > ... > {code} > It gets so bad it looks like deadlock but if you give it a while, we move on > (I put it down to safe point giving a misleading view on what is happening). > Let me put back the optimization. -- This message was sent by Atlassian JIRA (v6.3.15#6346)