[jira] [Comment Edited] (PHOENIX-3111) Possible Deadlock/delay while building index, upsert select, delete rows at server
[ https://issues.apache.org/jira/browse/PHOENIX-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389608#comment-15389608 ] James Taylor edited comment on PHOENIX-3111 at 7/23/16 12:49 AM: - Just want to make sure I understand the scenario under which this can occur. Does the problem only occur when we're reading and writing to the same table through a coprocessor? Aren't splits already blocked because we're holding a region lock through the region.startRegionOperation() call? FWIW, both the DELETE and UPSERT SELECT cases are optimization which we could disable and execute them on the client if need be. The extra cost would be the data transfer between the client and server (which is probably pretty negligible for the DELETE case). Is it feasible to block splits in these cases and if so, what would the implications be? Is this a theoretical problem, or have you actually seen it happen and can repro it? was (Author: jamestaylor): Just want to make sure I understand the scenario under which this can occur. Does the problem only occur when we're reading and writing to the same table through a coprocessor? FWIW, both the DELETE and UPSERT SELECT cases are optimization which we could disable and execute them on the client if need be. The extra cost would be the data transfer between the client and server (which is probably pretty negligible for the DELETE case). Is it feasible to block splits in these cases and if so, what would the implications be? > Possible Deadlock/delay while building index, upsert select, delete rows at > server > -- > > Key: PHOENIX-3111 > URL: https://issues.apache.org/jira/browse/PHOENIX-3111 > Project: Phoenix > Issue Type: Bug >Reporter: Sergio Peleato >Assignee: Rajeshbabu Chintaguntla >Priority: Blocker > Fix For: 4.8.0 > > > There is a possible deadlock while building local index or running upsert > select, delete at server. The situation might happen in this case. > In the above queries we scan mutations from table and write back to same > table in that case there is a chance of memstore might reach the threshold of > blocking memstore size then RegionTooBusyException might be thrown back to > client and queries might retry scanning. > Let's suppose if we take a local index build index case we first scan from > the data table and prepare index mutations and write back to same table. > So there is chance of memstore full as well in that case we try to flush the > region. But if the split happen in between then split might be waiting for > write lock on the region to close and flush wait for readlock because the > write lock in the queue until the local index build completed. Local index > build won't complete because we are not allowed to write until there is > flush. This might not be complete deadlock situation but the queries might > take lot of time to complete in this cases. > {noformat} > "regionserver//192.168.0.53:16201-splits-1469165876186" #269 prio=5 > os_prio=31 tid=0x7f7fb2050800 nid=0x1c033 waiting on condition > [0x000139b68000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0006ede72550> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1422) > at > org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1370) > - locked <0x0006ede69d00> (a java.lang.Object) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsBeforePONR(SplitTransactionImpl.java:394) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.createDaughters(SplitTransactionImpl.java:278) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:561) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(Threa
[jira] [Comment Edited] (PHOENIX-3111) Possible Deadlock/delay while building index, upsert select, delete rows at server
[ https://issues.apache.org/jira/browse/PHOENIX-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398303#comment-15398303 ] James Taylor edited comment on PHOENIX-3111 at 7/28/16 10:17 PM: - Thanks again for the patch, [~rajeshbabu] & [~sergey.soldatov]. Here are some comments/questions: - What sets {{isRegionClosing}} back to false if it's set to true in preClose()? Since the region is closing, do we always get a new instance of the UngroupedAggregateRegionObserver, so when it opens again, it be initialized to false? Or should we also explicitly be setting {{isRegionClosing}} to false in pre or postOpen()? Is there any chance that it would get stuck in a true state (i.e. can preClose() complete and then the close not actually happen? - We need more comments to explain this (even the mighty Lars is having a hard time - imagine the HBase novice like me trying to understand it). Can we have the Sergey's very nice comment above as javadoc around the declaration of the lock object? Plus, add to this a (5),(6),(7)... that describes at a high level the approach to solve this. - How about adding an {{@GuardedBy("lock")}} to the {{scansReferenceCount}} declaration? - What does {{blockingMemStoreSize}} represent exactly? Is it an upper bound on how many bytes can occur before the memstore is full? Add comment, please. {code} +final long blockingMemStoreSize = flushSize * ( +conf.getLong(HConstants.HREGION_MEMSTORE_BLOCK_MULTIPLIER, + HConstants.DEFAULT_HREGION_MEMSTORE_BLOCK_MULTIPLIER)-1) ; {code} - Then this code will basically delay the writing to the memstore while we're over this threshold for 3 seconds. Is this to give the flush a chance to happen (since that's what'll cause the memstore size to decrease)? Would be good to document this more. Will this throttling occur in the normal course of things and how significant a delay is the three seconds? In the non local index case, we take this code path purely as an optimization - would it be better to just not do this optimization and let the data go back to the client and be pushed back to the server? {code} + // We are waiting 3 seconds for the memstore flush happen + for (int i = 0; region.getMemstoreSize() > blockingMemstoreSize && i < 30; i++) { + try { + checkForRegionClosing(); + Thread.sleep(100); + } catch (InterruptedException e) { + throw new IOException(e); + } + } + checkForRegionClosing(); {code} - Under what condition would the region close during the commitBatch call (since we're already blocking splits)? Is it if the region gets reassigned for some reason? Can we document this there? - Should we be blocking a merge too? Or are we not so worried about those because they're not as common and user initiated? was (Author: jamestaylor): Thanks again for the patch, [~rajeshbabu] & [~sergey.soldatov]. Here are some comments/questions: - What sets {{isRegionClosing}} back to false if it's set to true in preClose()? Since the region is closing, do we always get a new instance of the UngroupedAggregateRegionObserver, so when it opens again, it be initialized to false? Or should we also explicitly be setting {{isRegionClosing}} to false in pre or postOpen()? Is there any chance that it would get stuck in a true state (i.e. can preClose() complete and then the close not actually happen? - We need more comments to explain this (even the mighty Lars is having a hard time - imagine the HBase novice like me trying to understand it). Can we have the Sergey's very nice comment above as javadoc around the declaration of the lock object? Plus, add to this a (5),(6),(7)... that describes at a high level the approach to solve this. - How about adding an {{@GuardedBy("lock")}} to the {{scansReferenceCount}} declaration? - What does {{blockingMemStoreSize}} represent exactly? Is it an upper bound on how many bytes can occur before the memstore is full? Add comment, please. {code} +final long blockingMemStoreSize = flushSize * ( +conf.getLong(HConstants.HREGION_MEMSTORE_BLOCK_MULTIPLIER, + HConstants.DEFAULT_HREGION_MEMSTORE_BLOCK_MULTIPLIER)-1) ; + - Then this code will basically delay the writing to the memstore while we're over this threshold for 3 seconds. Is this to give the flush a chance to happen (since that's what'll cause the memstore size to decrease)? Would be good to document this more. Will this throttling occur in the normal course of things and how significant a delay is the three seconds? In the non local index case, we take this code path purely as an optimization - would it be better to just not do this optimization and let the data go back to the client and be pushed back to the server? {code} + // We are wa
[jira] [Comment Edited] (PHOENIX-3111) Possible Deadlock/delay while building index, upsert select, delete rows at server
[ https://issues.apache.org/jira/browse/PHOENIX-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398720#comment-15398720 ] Samarth Jain edited comment on PHOENIX-3111 at 7/29/16 5:23 AM: [~rajeshbabu] - my question is an extension to what [~jamestaylor] had. Does the split or bulk load or other similar processes create their own instances of UngroupedAggregateRegionObserver? {code} @Override public void preSplit(ObserverContext c, byte[] splitRow) throws IOException { // Don't allow splitting if operations need read and write to same region are going on in the // the coprocessors to avoid dead lock scenario. See PHOENIX-3111. synchronized (lock) { if (scansReferenceCount != 0) { throw new IOException("Operations like local index building/delete/upsert select" + " might be going on so not allowing to split."); } } } @Override public void preBulkLoadHFile(ObserverContext c, List> familyPaths) throws IOException { // Don't allow bulkload if operations need read and write to same region are going on in the // the coprocessors to avoid dead lock scenario. See PHOENIX-3111. synchronized (lock) { if (scansReferenceCount != 0) { throw new DoNotRetryIOException("Operations like local index building/delete/upsert select" + " might be going on so not allowing to bulkload."); } } } @Override public void preClose(ObserverContext c, boolean abortRequested) throws IOException { synchronized (lock) { while (scansReferenceCount != 0) { isRegionClosing = true; try { lock.wait(1000); } catch (InterruptedException e) { } } } } {code} If yes, then wouldn't they just see default values of member variables isRegionClosing, scansReferenceCount, etc? If yes, how is the whole synchronization going to work? I *think* you are probably going to need a singleton to track state at the region server level. See StatisticsCollectionRunTracker as an example which tracks state for every region undergoing stats collection via compaction/update stats command. Also worth seeing as an example is how it stops an update stats command from doing stats collection if major compaction is kicked off. was (Author: samarthjain): [~rajeshbabu] - my question is an extension to what [~jamestaylor] had. Does the split process creates its own instance of UngroupedAggregateRegionObserver? If yes, then wouldn't it see default values of member variables isRegionClosing, scansReferenceCount etc? If yes, how is the whole synchronization going to work? I *think* you are probably going to need a singleton to track state at the region server level. See StatisticsCollectionRunTracker as an example of how this could be done. > Possible Deadlock/delay while building index, upsert select, delete rows at > server > -- > > Key: PHOENIX-3111 > URL: https://issues.apache.org/jira/browse/PHOENIX-3111 > Project: Phoenix > Issue Type: Bug >Reporter: Sergio Peleato >Assignee: Rajeshbabu Chintaguntla >Priority: Critical > Fix For: 4.8.1 > > Attachments: PHOENIX-3111.patch > > > There is a possible deadlock while building local index or running upsert > select, delete at server. The situation might happen in this case. > In the above queries we scan mutations from table and write back to same > table in that case there is a chance of memstore might reach the threshold of > blocking memstore size then RegionTooBusyException might be thrown back to > client and queries might retry scanning. > Let's suppose if we take a local index build index case we first scan from > the data table and prepare index mutations and write back to same table. > So there is chance of memstore full as well in that case we try to flush the > region. But if the split happen in between then split might be waiting for > write lock on the region to close and flush wait for readlock because the > write lock in the queue until the local index build completed. Local index > build won't complete because we are not allowed to write until there is > flush. This might not be complete deadlock situation but the queries might > take lot of time to complete in this cases. > {noformat} > "regionserver//192.168.0.53:16201-splits-1469165876186" #269 prio=5 > os_prio=31 tid=0x7f7fb2050800 nid=0x1c033 waiting on condition > [0x000139b68000] >java.lang.Thread.State: WAITING (parking) >
[jira] [Comment Edited] (PHOENIX-3111) Possible Deadlock/delay while building index, upsert select, delete rows at server
[ https://issues.apache.org/jira/browse/PHOENIX-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165869#comment-16165869 ] Rajeshbabu Chintaguntla edited comment on PHOENIX-3111 at 9/14/17 7:29 AM: --- [~jamestaylor] Can you please provide some details like what operations going on at that moment like any upsert selects to the same table, deletes or index rebuilding? In HBase what we can do is when ever NotServingRegionException thrown from coprocessors we need to retry from the HBase client but it's not happening because it's wrapped with different exception(Forgot the exact one :() and client is not retrying after that. was (Author: rajeshbabu): [~jamestaylor] Can you please provide some details like what operations going on at that moment like any upsert selects to the same table, deletes or index rebuilding? > Possible Deadlock/delay while building index, upsert select, delete rows at > server > -- > > Key: PHOENIX-3111 > URL: https://issues.apache.org/jira/browse/PHOENIX-3111 > Project: Phoenix > Issue Type: Bug >Reporter: Sergio Peleato >Assignee: Rajeshbabu Chintaguntla >Priority: Critical > Fix For: 4.8.0 > > Attachments: PHOENIX-3111_addendum.patch, PHOENIX-3111.patch, > PHOENIX-3111_v2.patch > > > There is a possible deadlock while building local index or running upsert > select, delete at server. The situation might happen in this case. > In the above queries we scan mutations from table and write back to same > table in that case there is a chance of memstore might reach the threshold of > blocking memstore size then RegionTooBusyException might be thrown back to > client and queries might retry scanning. > Let's suppose if we take a local index build index case we first scan from > the data table and prepare index mutations and write back to same table. > So there is chance of memstore full as well in that case we try to flush the > region. But if the split happen in between then split might be waiting for > write lock on the region to close and flush wait for readlock because the > write lock in the queue until the local index build completed. Local index > build won't complete because we are not allowed to write until there is > flush. This might not be complete deadlock situation but the queries might > take lot of time to complete in this cases. > {noformat} > "regionserver//192.168.0.53:16201-splits-1469165876186" #269 prio=5 > os_prio=31 tid=0x7f7fb2050800 nid=0x1c033 waiting on condition > [0x000139b68000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0006ede72550> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1422) > at > org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1370) > - locked <0x0006ede69d00> (a java.lang.Object) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsBeforePONR(SplitTransactionImpl.java:394) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.createDaughters(SplitTransactionImpl.java:278) > at > org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:561) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82) > at > org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) >Locked ownable synchronizers: > - <0x0006ee132098> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > {noformat} > "MemStoreFlusher.0" #170 prio=5 os_prio=31 tid=0x7f7fb6842000 nid=0x19303 > waiting on condition [0x0001388e9000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) >
[jira] [Comment Edited] (PHOENIX-3111) Possible Deadlock/delay while building index, upsert select, delete rows at server
[ https://issues.apache.org/jira/browse/PHOENIX-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166604#comment-16166604 ] Andrew Purtell edited comment on PHOENIX-3111 at 9/14/17 5:06 PM: -- bq. From HBase side we need to revert HBASE-14893 and handle NotServingRegionException thrown in coprocessors properly This needs very careful thought and extensive testing before any changes should be committed. Any change to locking discipline around region flushing/closing/splitting can be the road to data loss. Look at what happened with the storefile locking changes that went in to 1.3, for example. We should never have done it. Once burned, twice shy. I wonder if we should extend use of Phoenix's new LockManager to other cases, so we avoid tinkering with complex internal locking state and semantics in HBase. Locking disciplines are not part of HBase compatibility guarantees. They will vary between code lines and might even change between point releases. This doesn't necessarily help where internal HBase locking is causing issues, but we are going to have an extremely high bar for admitting changes to locking going forward. [~lhofhansl] [~stack] [~mantonov] [~ghelmling] was (Author: apurtell): bq. From HBase side we need to revert HBASE-14893 and handle NotServingRegionException thrown in coprocessors properly This needs very careful thought and extensive testing before any changes should be committed. Any change to locking discipline around region flushing/closing/splitting can be the road to data loss. Look at what happened with the storefile locking changes that went in to 1.3, for example. We should never have done it. Once burned, twice shy. I wonder if we should extend use of Phoenix's new LockManager to other cases, so we avoid tinkering with complex internal locking state and semantics in HBase. Locking disciplines are not part of HBase compatibility guarantees. They will vary between code lines and might even change between point releases. This doesn't necessarily help where internal HBase locking is causing issues, but we are going to have an extremely high bar for admitting changes to locking going forward. > Possible Deadlock/delay while building index, upsert select, delete rows at > server > -- > > Key: PHOENIX-3111 > URL: https://issues.apache.org/jira/browse/PHOENIX-3111 > Project: Phoenix > Issue Type: Bug >Reporter: Sergio Peleato >Assignee: Rajeshbabu Chintaguntla >Priority: Critical > Fix For: 4.8.0 > > Attachments: PHOENIX-3111_addendum.patch, PHOENIX-3111.patch, > PHOENIX-3111_v2.patch > > > There is a possible deadlock while building local index or running upsert > select, delete at server. The situation might happen in this case. > In the above queries we scan mutations from table and write back to same > table in that case there is a chance of memstore might reach the threshold of > blocking memstore size then RegionTooBusyException might be thrown back to > client and queries might retry scanning. > Let's suppose if we take a local index build index case we first scan from > the data table and prepare index mutations and write back to same table. > So there is chance of memstore full as well in that case we try to flush the > region. But if the split happen in between then split might be waiting for > write lock on the region to close and flush wait for readlock because the > write lock in the queue until the local index build completed. Local index > build won't complete because we are not allowed to write until there is > flush. This might not be complete deadlock situation but the queries might > take lot of time to complete in this cases. > {noformat} > "regionserver//192.168.0.53:16201-splits-1469165876186" #269 prio=5 > os_prio=31 tid=0x7f7fb2050800 nid=0x1c033 waiting on condition > [0x000139b68000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0006ede72550> (a > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) > at > org.apache.hadoop.hbase.regionserver.HRegion.do