Hello! I recommend setting it somewhat lower, but longer than any of your expected GC pauses. 30s is OK.
Regards, -- Ilya Kasnacheev вс, 12 июл. 2020 г. в 14:03, Kamlesh Joshi <kamlesh.jo...@ril.com>: > Thanks for the findings Ilya. > > > > So shall we set the same timeout value for *socketWriteTimeout* as that > of failure detection timeout on both client and server side? > > > > > > *Thanks and Regards,* > > *Kamlesh Joshi* > > > > *From:* Ilya Kasnacheev <ilya.kasnach...@gmail.com> > *Sent:* 10 July 2020 19:48 > *To:* user@ignite.apache.org > *Subject:* Re: [External]Re: Ignite cluster became unresponsive > > > > The e-mail below is from an external source. Please do not open > attachments or click links from an unknown or suspicious origin. > > Hello! > > > > It seems that communication connections were closed after CG pause, then > you have got half-open connections. It is recommended to keep > socketWriteTimeout and failure detection timeout in relative sync. > > > > Default socketWriteTimeout on TcpConnectionSpi is very low while your > failure detection timeout is rather high, leading to such issue. > > > > It is also possible that client nodes can connect to a server node but not > vice versa, leading to failure of opening connections once they are closed: > > > > Thread [name="sys-stripe-12-#13%EDIFCustomerCC%", id=45, state=RUNNABLE, > blockCnt=851, waitCnt=27526057] > at sun.nio.ch.Net.poll(Native Method) > at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713) > at > o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672) > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пт, 10 июл. 2020 г. в 16:32, Kamlesh Joshi <kamlesh.jo...@ril.com>: > > Hi Ilya, > > > > PFA the entire node logs, which contains thread dump as well. Let us know > if any findings. > > > > *Thanks and Regards,* > > *Kamlesh Joshi* > > > > *From:* Ilya Kasnacheev <ilya.kasnach...@gmail.com> > *Sent:* 10 July 2020 17:51 > *To:* user@ignite.apache.org > *Subject:* Re: [External]Re: Ignite cluster became unresponsive > > > > The e-mail below is from an external source. Please do not open > attachments or click links from an unknown or suspicious origin. > > Hello! > > > > Can you provide full thread dump (jstack) after you see these messages? > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > ср, 8 июл. 2020 г. в 15:57, Kamlesh Joshi <kamlesh.jo...@ril.com>: > > Hi Stephen/Team, > > > > Did you got any chance to look into this? > > > > *Thanks and Regards,* > > *Kamlesh Joshi* > > > > *From:* Kamlesh Joshi > *Sent:* 06 July 2020 14:50 > *To:* user@ignite.apache.org > *Subject:* RE: [External]Re: Ignite cluster became unresponsive > > > > Hi Stephen, > > > > We have started our node with below JVM parameters. Also, we have > increased these timeouts *failureDetectionTimeout*/ > *clientFailureDetectionTimeout*/*networkTimeout to 480000*. > > > > *-XX:+AggressiveOpts -XX:+AlwaysPreTouch -XX:+UseG1GC > -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC > -XX:+UnlockCommercialFeatures -Djava.net.preferIPv4Stack=true > -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=600000 > -DIGNITE_THREAD_DUMP_ON_EXCHANGE_TIMEOUT=true -Dfile.encoding=UTF-8 > -DIGNITE_QUIET=false* > > > > Is there anything else that we have to tune ? > > > > And I think JVM pause is introduced as a result of the error that we > encountered right? Correct me if am wrong. > > > > *Thanks and Regards,* > > *Kamlesh Joshi* > > > > *From:* Stephen Darlington <stephen.darling...@gridgain.com> > *Sent:* 06 July 2020 14:09 > *To:* user <user@ignite.apache.org> > *Subject:* [External]Re: Ignite cluster became unresponsive > > > > The e-mail below is from an external source. Please do not open > attachments or click links from an unknown or suspicious origin. > > There are a few issues here — the blocked thread, the communication error > — but I possibly the key one is the JVM pause: > > > > *[2020-07-03T18:17:21,793][WARN > ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM > pause: 10133 milliseconds.* > > > > This is usually due to garbage collection, but there are a number of other > possibilities such as slow I/O. Suggest you start with the recommendations > on the GC tuning documentation page: > https://apacheignite.readme.io/docs/jvm-and-system-tuning > > > > Regards, > > Stephen > > > > On 4 Jul 2020, at 12:44, Kamlesh Joshi <kamlesh.jo...@ril.com> wrote: > > > > Hi Team, > > > > We have encountered following defect in PROD environment. After which > entire traffic got halted for around 10 minutes, we recently upgraded our > cluster to Ignite 2.7.6 from 2.6.0. > > Is this related to any existing open defect in this version? Has anyone > observed the same defect earlier ? > > > > Any help or pointers around this will be appreciated. > > > > > > *[2020-07-03T18:17:11,613][ERROR][sys-stripe-36-#37%CustomerCC%][G] > Blocked system-critical thread has been detected. This can lead to > cluster-wide undefined behaviour* > > *[threadName=partition-exchanger, blockedFor=480s]* > > *[2020-07-03T18:17:11,613][WARN ][sys-stripe-36-#37%CustomerCC%][G] Thread > [name="exchange-worker-#344%CustomerCC%", id=391, state=TIMED_WAITING, > blockCnt=1, waitCnt=2049782]* > > * Lock > [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6bf9f3a4, > ownerName=null, ownerId=-1]* > > > > *[2020-07-03T18:17:11,620][ERROR][sys-stripe-36-#37%CustomerCC%][] > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, > super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker > [name=partition-exchanger, igniteInstanceName=CustomerCC, finished=false, > heartbeatTs=1593780431612]]]* > > *org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, > igniteInstanceName=CustomerCC, finished=false, heartbeatTs=1593780431612]* > > * at > org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) > [ignite-core-2.7.6.jar:2.7.6]* > > * at > org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) > [ignite-core-2.7.6.jar:2.7.6]* > > * at > org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) > [ignite-core-2.7.6.jar:2.7.6]* > > * at > org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) > [ignite-core-2.7.6.jar:2.7.6]* > > * at > org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) > [ignite-core-2.7.6.jar:2.7.6]* > > * at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > [ignite-core-2.7.6.jar:2.7.6]* > > * at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]* > > *[2020-07-03T18:17:11,625][WARN > ][sys-stripe-36-#37%CustomerCC%][FailureProcessor] No deadlocked threads > detected.* > > *[2020-07-03T18:17:21,790][INFO > ][tcp-disco-sock-reader-#201%CustomerCC%][TcpDiscoverySpi] Finished serving > remote node connection [rmtAddr=/xx.xx.xx.xx:46416, rmtPort=46416* > > *[2020-07-03T18:17:21,793][WARN > ][jvm-pause-detector-worker][IgniteKernal%CustomerCC] Possible too long JVM > pause: 10133 milliseconds.* > > * [2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-31-#295%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:11764, writeTimeout=2000]* > > *[2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-57-#321%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:38500, writeTimeout=2000]* > > *[2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-5-#269%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:41442, writeTimeout=2000]* > > *[2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:44178, writeTimeout=2000]* > > *[2020-07-03T18:17:21,794][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:11884, writeTimeout=2000]* > > *[2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:39044, writeTimeout=2000]* > > *[2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-53-#317%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:48756, writeTimeout=2000]* > > *[2020-07-03T18:17:21,795][WARN > ][grid-nio-worker-tcp-comm-59-#323%CustomerCC%][TcpCommunicationSpi] > Communication SPI session write timed out (consider increasing > 'socketWriteTimeout' configuration property) > [remoteAddr=/xx.xx.xx.xx:42190, writeTimeout=2000]* > > > > > > > > > > > > *Thanks and Regards,* > > *Kamlesh Joshi* > > > > > "*Confidentiality Warning*: This message and any attachments are intended > only for the use of the intended recipient(s), are confidential and may be > privileged. If you are not the intended recipient, you are hereby notified > that any review, re-transmission, conversion to hard copy, copying, > circulation or other use of this message and any attachments is strictly > prohibited. If you are not the intended recipient, please notify the sender > immediately by return email and delete this message and any attachments > from your system. > > *Virus Warning:* Although the company has taken reasonable precautions to > ensure no viruses are present in this email. The company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachment." > > > > > "*Confidentiality Warning*: This message and any attachments are intended > only for the use of the intended recipient(s), are confidential and may be > privileged. If you are not the intended recipient, you are hereby notified > that any review, re-transmission, conversion to hard copy, copying, > circulation or other use of this message and any attachments is strictly > prohibited. If you are not the intended recipient, please notify the sender > immediately by return email and delete this message and any attachments > from your system. > > *Virus Warning:* Although the company has taken reasonable precautions to > ensure no viruses are present in this email. The company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachment." > > > "*Confidentiality Warning*: This message and any attachments are intended > only for the use of the intended recipient(s), are confidential and may be > privileged. If you are not the intended recipient, you are hereby notified > that any review, re-transmission, conversion to hard copy, copying, > circulation or other use of this message and any attachments is strictly > prohibited. If you are not the intended recipient, please notify the sender > immediately by return email and delete this message and any attachments > from your system. > > *Virus Warning:* Although the company has taken reasonable precautions to > ensure no viruses are present in this email. The company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachment." > > > "*Confidentiality Warning*: This message and any attachments are intended > only for the use of the intended recipient(s), are confidential and may be > privileged. If you are not the intended recipient, you are hereby notified > that any review, re-transmission, conversion to hard copy, copying, > circulation or other use of this message and any attachments is strictly > prohibited. If you are not the intended recipient, please notify the sender > immediately by return email and delete this message and any attachments > from your system. > > *Virus Warning:* Although the company has taken reasonable precautions to > ensure no viruses are present in this email. The company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachment." >