[ https://issues.apache.org/jira/browse/IGNITE-11288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pavel Voronkin updated IGNITE-11288: ------------------------------------ Description: Rootcause is java bug locking on SSLSocketImpl.close() on write lock: //we create socket with soTimeout(0) here, but setting it here won't help anyway. RingMessageWorker: 3152 sock = spi.openSocket(addr, timeoutHelper); //After timeout grid-timeout-worker blocks forever but SSLSOcketImpl.close() onTimeout hangs on writeLock. According to java8 SSLSocketImpl: {code:java} if (var1.isAlert((byte)0) && this.getSoLinger() >= 0) { boolean var3 = Thread.interrupted(); try { if (this.writeLock.tryLock((long)this.getSoLinger(), TimeUnit.SECONDS)) { try { this.writeRecordInternal(var1, var2); } finally { this.writeLock.unlock(); } } else { SSLException var4 = new SSLException("SO_LINGER timeout, close_notify message cannot be sent."); if (this.isLayered() && !this.autoClose) { this.fatal((byte)-1, (Throwable)var4); } else if (debug != null && Debug.isOn("ssl")) { System.out.println(Thread.currentThread().getName() + ", received Exception: " + var4); } this.sess.invalidate(); } } catch (InterruptedException var14) { var3 = true; } if (var3) { Thread.currentThread().interrupt(); } } else { this.writeLock.lock(); try { this.writeRecordInternal(var1, var2); } finally { this.writeLock.unlock(); } }{code} In case of soLinger is not set we fallback to this.writeLock.lock(); which wait forever, cause RingMessageWorker is writing message with SO_TIMEOUT zero. Solution: 1) Set proper SO_TIMEOUT //that didn't help on Linux in case we drop packets using iptables. 2) Set SO_LINGER to some reasonable positive value. Similar JDK bug [https://bugs.openjdk.java.net/browse/JDK-6668261]. Guys end up setting SO_LINGER> was: Rootcause is java bug locking on SSLSocketImpl.close() on write lock: //we create socket with soTimeout(0) here, but setting it here won't help anyway. RingMessageWorker: 3152 sock = spi.openSocket(addr, timeoutHelper); //After timeout grid-timeout-worker blocks forever but SSLSOcketImpl.close() onTimeout hangs on writeLock. According to java8 SSLSocketImpl: {code:java} if (var1.isAlert((byte)0) && this.getSoLinger() >= 0) { boolean var3 = Thread.interrupted(); try { if (this.writeLock.tryLock((long)this.getSoLinger(), TimeUnit.SECONDS)) { try { this.writeRecordInternal(var1, var2); } finally { this.writeLock.unlock(); } } else { SSLException var4 = new SSLException("SO_LINGER timeout, close_notify message cannot be sent."); if (this.isLayered() && !this.autoClose) { this.fatal((byte)-1, (Throwable)var4); } else if (debug != null && Debug.isOn("ssl")) { System.out.println(Thread.currentThread().getName() + ", received Exception: " + var4); } this.sess.invalidate(); } } catch (InterruptedException var14) { var3 = true; } if (var3) { Thread.currentThread().interrupt(); } } else { this.writeLock.lock(); try { this.writeRecordInternal(var1, var2); } finally { this.writeLock.unlock(); } }{code} In case of soLinger is not set we fallback to this.writeLock.lock(); which wait forever, cause RingMessageWorker is writing message with SO_TIMEOUT zero. Solution: 1) Set proper SO_TIMEOUT //we checked that didn' help on Linux if drop packets using iptables . 2) Set SO_LINGER to some reasonable positive value. Similar JDK bug [https://bugs.openjdk.java.net/browse/JDK-6668261]. > TcpDiscovery deadlock on SSLSocket.close(). > ------------------------------------------- > > Key: IGNITE-11288 > URL: https://issues.apache.org/jira/browse/IGNITE-11288 > Project: Ignite > Issue Type: Bug > Reporter: Pavel Voronkin > Assignee: Pavel Voronkin > Priority: Critical > Time Spent: 10m > Remaining Estimate: 0h > > Rootcause is java bug locking on SSLSocketImpl.close() on write lock: > //we create socket with soTimeout(0) here, but setting it here won't help > anyway. > RingMessageWorker: 3152 sock = spi.openSocket(addr, timeoutHelper); > //After timeout grid-timeout-worker blocks forever but SSLSOcketImpl.close() > onTimeout hangs on writeLock. > According to java8 SSLSocketImpl: > {code:java} > if (var1.isAlert((byte)0) && this.getSoLinger() >= 0) { > boolean var3 = Thread.interrupted(); > try { > if (this.writeLock.tryLock((long)this.getSoLinger(), TimeUnit.SECONDS)) { > try > { this.writeRecordInternal(var1, var2); } > finally > { this.writeLock.unlock(); } > } else > { SSLException var4 = new SSLException("SO_LINGER timeout, close_notify > message cannot be sent."); if (this.isLayered() && !this.autoClose) { > this.fatal((byte)-1, (Throwable)var4); } > else if (debug != null && Debug.isOn("ssl")) > { System.out.println(Thread.currentThread().getName() + ", received > Exception: " + var4); } > this.sess.invalidate(); > } > } catch (InterruptedException var14) > { var3 = true; } > if (var3) > { Thread.currentThread().interrupt(); } > } else > { this.writeLock.lock(); try { this.writeRecordInternal(var1, var2); } > finally > { this.writeLock.unlock(); } > }{code} > In case of soLinger is not set we fallback to this.writeLock.lock(); which > wait forever, cause RingMessageWorker is writing message with SO_TIMEOUT zero. > Solution: > 1) Set proper SO_TIMEOUT //that didn't help on Linux in case we drop packets > using iptables. > 2) Set SO_LINGER to some reasonable positive value. > Similar JDK bug [https://bugs.openjdk.java.net/browse/JDK-6668261]. > Guys end up setting SO_LINGER> > -- This message was sent by Atlassian JIRA (v7.6.3#76005)