[ 
https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219984#comment-17219984
 ] 

ASF GitHub Bot commented on GEODE-8652:
---------------------------------------

Bill opened a new pull request #5666:
URL: https://github.com/apache/geode/pull/5666


   - NioSslEngine.close() to proceed even if readers (or writers) are
     operating on its ByteBuffers, allowing Connection.close() to close its
     socket and proceed.
   
   Co-authored-by: Bill Burcham <bill.burc...@gmail.com>
   Co-authored-by: Darrel Schneider <dschnei...@pivotal.io>
   Co-authored-by: Ernie Burghardt <burghar...@vmware.com>
   
   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in 
the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build 
issues and
   submit an update to your PR as soon as possible. If you need help, please 
send an
   email to d...@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> member hung in Connection.notifyHandshakeWaiter() during disconnect waiting 
> for a lock held by another thread in Connection.readAck() 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8652
>                 URL: https://issues.apache.org/jira/browse/GEODE-8652
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>    Affects Versions: 1.14.0
>            Reporter: Bill Burcham
>            Priority: Major
>
> An application encountered the following hang in a TLS-enabled cluster.
> Let's call the cluster members ds3 -> ds1. 
> ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in 
> {{SocketChannel.read()}} waiting for the acknowledgement.
> {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut 
> the member down. That thread blocks trying to acquire a lock on the 
> {{NioSslEngine}} held by the first thread (the one doing waiting for the ack 
> to the put-all.)
> Somehow the shutdown thread must be allowed to proceed.
> Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) 
> trying to shut down the member but it's stuck waiting for the monitor on the 
> {{NioSslEngine}}:
> {noformat}
> "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 
> tid=0x00007fdb9c011000 nid=0x2fcc waiting for monitor entry 
> [0x00007fdb6f4b7000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804)
>       - waiting to lock <0x00000000f2635b28> (a 
> org.apache.geode.internal.net.NioSslEngine)
>       at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350)
>       at 
> org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
>       at 
> org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661)
>       - locked <0x00000000f2678cf8> (a java.util.ArrayList)
>       - locked <0x00000000f1187348> (a java.util.concurrent.ConcurrentHashMap)
>       at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487)
>       at 
> org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644)
>       - locked <0x00000000f11867a8> (a 
> org.apache.geode.distributed.internal.direct.DirectChannel)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908)
>       at 
> org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262)
>       at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847)
>       at 
> org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501)
>       at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291)
> {noformat}
> That thread is waiting on a lock held by this thread (in ds3) which is 
> waiting on an acknowledgement to a PutAllPRMessage sent to ds1.
> {noformat}
> "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 
> tid=0x00007fdb9c030800 nid=0x30d1 runnable [0x00007fdb732f0000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378)
>         - locked <0x00000000f2643380> (a java.lang.Object)
>         at 
> org.apache.geode.internal.net.NioSslEngine.readAtLeast(NioSslEngine.java:330)
>         at 
> org.apache.geode.internal.tcp.MsgReader.readAtLeast(MsgReader.java:129)
>         at 
> org.apache.geode.internal.tcp.MsgReader.readHeader(MsgReader.java:58)
> ==>     - locked <0x00000000f2635b28> (a 
> org.apache.geode.internal.net.NioSslEngine)
>         at 
> org.apache.geode.internal.tcp.Connection.readAck(Connection.java:2652)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.readAcks(DirectChannel.java:392)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:342)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182)
>         at 
> org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511)
>         at 
> org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346)
>         at 
> org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018)
>         at 
> org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083)
>         at 
> org.apache.geode.internal.cache.partitioned.PutAllPRMessage.send(PutAllPRMessage.java:201)
>         at 
> org.apache.geode.internal.cache.PartitionedRegion.tryToSendOnePutAllMessage(PartitionedRegion.java:2839)
>         at 
> org.apache.geode.internal.cache.PartitionedRegion.sendMsgByBucket(PartitionedRegion.java:2621)
>         at 
> org.apache.geode.internal.cache.PartitionedRegion.postPutAllSend(PartitionedRegion.java:2392)
>         at 
> org.apache.geode.internal.cache.LocalRegionDataView.postPutAll(LocalRegionDataView.java:361)
>         at 
> org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9154)
>         at 
> org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:8903)
> {noformat}
> What we see is that the {{MsgReader}} in the second thread is not letting the 
> first thread close the socket. Until the socket is closed, the second thread 
> will be stuck in {{SocketChannel.read()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to