[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805720#comment-17805720 ] Vipul Thakur edited comment on IGNITE-21059 at 1/25/24 7:08 AM: I also ran *{{control.sh|bat --cache contention 5}}* *OUTPUT* JVM_OPTS environment variable is set, but will not be used. To pass JVM options use CONTROL_JVM_OPTS JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection INFO: Client TCP connection established: localhost/127.0.0.1:11211 2024-01-11T22:40:23,579][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%|#25%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=x.x.x.x:41264, rmtAddr=/x.x.x.x:47100] 2024-01-11T22:40:23,594][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%|#26%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/x.x.x.x:56674, rmtAddr=/x.x.x.x:47100] Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection close INFO: Client TCP connection closed: localhost/127.0.0.1:11211 Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils shutdownNow WARNING: Runnable tasks outlived thread pool executor service [owner=GridClientConnectionManager, tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]] [node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList [x.x.x.x 127.0.0.1], sockAddrs=null, discPort=47500, order=90, intOrder=48, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList [x.x.x.x 127.0.0.1], sockAddrs=null, discPort=47500, order=88, intOrder=47, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode [id=855b22e7-0ad7-4521-ab53-3af65b6fce73, consistentId=ee70a820-92a5-48c7-a5da-4965c946b550, addrs=ArrayList [x.x.x.x, 127.0.0.1], sockAddrs=null, discPort=47500, order=4, intOrder=4, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] Control utility [ver. 2.14.0#20220929-sha1:951e8deb] 2022 Copyright(C) Apache Software Foundation Time: 2024-01-11T22:40:22.947 Command [CACHE] started Arguments: --host localhost --port 11211 --user --password * --cache contention 5 Command [CACHE] finished with code: 0 Control utility has completed execution at: 2024-01-11T22:40:23.734 Execution time: 787 ms was (Author: vipul.thakur): I also ran *{{control.sh|bat --cache contention 5}}* *OUTPUT* JVM_OPTS environment variable is set, but will not be used. To pass JVM options use CONTROL_JVM_OPTS JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection INFO: Client TCP connection established: localhost/127.0.0.1:11211 2024-01-11T22:40:23,579][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/10.135.34.53:41264, rmtAddr=/10.135.34.68:47100] 2024-01-11T22:40:23,594][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/10.135.34.53:56674, rmtAddr=/10.135.34.67:47100] Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection close INFO: Client TCP connection closed: localhost/127.0.0.1:11211 Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils shutdownNow WARNING: Runnable tasks outlived thread pool executor service [owner=GridClientConnectionManager, tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]] [node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList [10.135.34.67, 127.0.0.1], sockAddrs=null, discPort=47500, order=90, intOrder=48, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList [10.135.34.53, 127.0.0.1], sockAddrs=null, discPort=47500, order=88, intOrder=47, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806067#comment-17806067 ] Vipul Thakur commented on IGNITE-21059: --- As i said all the nodes are from same data center and we dont have any kind restrictions in terms of connectivity, could it be a network fluctuation , is there any way to benchmark wrt ignite ? coz we also think this is happening due to that but we dont have a way to benchmark it to our n/w team. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805949#comment-17805949 ] Vipul Thakur edited comment on IGNITE-21059 at 1/12/24 9:08 AM: All of them are from same data center and there is no firewall and all the required ports are open as well. It happened when i stopped the node and restarted it, 2nd time. was (Author: vipul.thakur): All of them are from same data center and there is no firewall and all the required ports are open as well. It happened when i stopped the node and restarted it. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805949#comment-17805949 ] Vipul Thakur commented on IGNITE-21059: --- All of them are from same data center and there is no firewall and all the required ports are open as well. It happened when i stopped the node and restarted it. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805720#comment-17805720 ] Vipul Thakur commented on IGNITE-21059: --- I also ran *{{control.sh|bat --cache contention 5}}* *OUTPUT* JVM_OPTS environment variable is set, but will not be used. To pass JVM options use CONTROL_JVM_OPTS JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection INFO: Client TCP connection established: localhost/127.0.0.1:11211 2024-01-11T22:40:23,579][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/10.135.34.53:41264, rmtAddr=/10.135.34.68:47100] 2024-01-11T22:40:23,594][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/10.135.34.53:56674, rmtAddr=/10.135.34.67:47100] Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection close INFO: Client TCP connection closed: localhost/127.0.0.1:11211 Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils shutdownNow WARNING: Runnable tasks outlived thread pool executor service [owner=GridClientConnectionManager, tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]] [node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList [10.135.34.67, 127.0.0.1], sockAddrs=null, discPort=47500, order=90, intOrder=48, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList [10.135.34.53, 127.0.0.1], sockAddrs=null, discPort=47500, order=88, intOrder=47, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] [node=TcpDiscoveryNode [id=855b22e7-0ad7-4521-ab53-3af65b6fce73, consistentId=ee70a820-92a5-48c7-a5da-4965c946b550, addrs=ArrayList [10.135.34.68, 127.0.0.1], sockAddrs=null, discPort=47500, order=4, intOrder=4, lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, isClient=false]] Control utility [ver. 2.14.0#20220929-sha1:951e8deb] 2022 Copyright(C) Apache Software Foundation Time: 2024-01-11T22:40:22.947 Command [CACHE] started Arguments: --host localhost --port 11211 --user --password * --cache contention 5 Command [CACHE] finished with code: 0 Control utility has completed execution at: 2024-01-11T22:40:23.734 Execution time: 787 ms > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805709#comment-17805709 ] Vipul Thakur edited comment on IGNITE-21059 at 1/11/24 4:59 PM: Hi [~zstan] | [~cos] I ran a test again today in my local environment by changing my consistency to optimistic and isolation level to serializable with 5s txn timeout and ran long running load with low traffic only like we have multiple jms listeners which communicate with ignite while writing data and during the load i restarted one node to mimic the change n/w topology of the cluster, so for the first time when i did this nothing happened and but when i did the next time with another node we can observe the same issue as we are observing in prod. The write services listeners went into choked state and my queue started piling up. [^ignite_issue_1101.zip] The zip contains the thread dump of the service , logs of the pod and logs from all the 3 nodes from that environment. We have increased the wal size to 512mb , reduce the txn timeout 5secs and rolled back failuredetection timeout and client failuredetection timeout to default values. Please help us with your observations. I have also modified my code to detect thread deadlock like below : !image-2024-01-11-22-28-51-501.png|width=638,height=248! was (Author: vipul.thakur): Hi [~zstan] | [~cos] I ran a test again today in my local environment by changing my consistency to optimistic and isolation level to serializable with 5s txn timeout and ran long running load with low traffic only like we have multiple jms listeners which communicate with ignite while writing data and during the load i restarted one node to mimic the change n/w topology of the cluster, so for the first time when i did this nothing happened and but when i did the next time with another node we can observe the same issue as we are observing in prod. The write services listeners went into choked state and my queue started piling up. [^ignite_issue_1101.zip] The zip contains the thread dump of the service , logs of the pod and logs from all the 3 nodes from that environment. We have increased the wal size to 512mb , reduce the txn timeout 5secs and rolled back failuredetection timeout and client failuredetection timeout to default values. Please help us with your observations. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805709#comment-17805709 ] Vipul Thakur commented on IGNITE-21059: --- Hi [~zstan] | [~cos] I ran a test again today in my local environment by changing my consistency to optimistic and isolation level to serializable with 5s txn timeout and ran long running load with low traffic only like we have multiple jms listeners which communicate with ignite while writing data and during the load i restarted one node to mimic the change n/w topology of the cluster, so for the first time when i did this nothing happened and but when i did the next time with another node we can observe the same issue as we are observing in prod. The write services listeners went into choked state and my queue started piling up. [^ignite_issue_1101.zip] The zip contains the thread dump of the service , logs of the pod and logs from all the 3 nodes from that environment. We have increased the wal size to 512mb , reduce the txn timeout 5secs and rolled back failuredetection timeout and client failuredetection timeout to default values. Please help us with your observations. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: ignite_issue_1101.zip > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, > image.png, long_txn_.png, nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801646#comment-17801646 ] Vipul Thakur commented on IGNITE-21059: --- Hi [~zstan] Thank you for the observation. We have also observed a new exception related to striped pool : 2023-12-29 16:41:09.426 ERROR 1 --- [api.endpoint-22] b.b.EventProcessingErrorHandlerJmsSender : >>> Published error message ..EventProcessingErrorHandlerJmsSender .. *2023-12-29 16:41:09.569 WARN 1 --- [85b8d7f7-ntw27%] o.a.i.i.processors.pool.PoolProcessor : >>> Possible starvation in striped pool.* *Thread name: sys-stripe-0-#1%DIGITALAPI__PRIMARY_digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27%* Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_TX, topicOrd=20, ordered=false, timeout=0, skipOnTimeout=false, msg=TxLocksResponse [futId=2236, nearTxKeyLocks=HashMap {}, txKeys=null]]], Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridNearLockResponse [pending=ArrayList [], miniId=1, dhtVers=GridCacheVersion[] [GridCacheVersion [topVer=312674347, order=1703970204663, nodeOrder=2, dataCenterId=0]], mappedVers=GridCacheVersion[] [GridCacheVersion [topVer=315266949, order=1703839756326, nodeOrder=2, dataCenterId=0]], clientRemapVer=null, compatibleRemapVer=false, super=GridDistributedLockResponse [futId=b9a9f75bc81-870cf83b-d2dd-4aa0-9d9f-bffdb8d46b1a, err=null, vals=ArrayList [BinaryObjectImpl [arr= true, ctx=false, start=0]], super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=315266949, order=1703839751829, nodeOrder=11, dataCenterId=0], commit PFB for detailed logs [^digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log] Could be it due to having too many read client that our write services are getting affected. Should we be trying to decrease the no read services? > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, > nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, > nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801194#comment-17801194 ] Vipul Thakur commented on IGNITE-21059: --- Hi [~zstan] Today we got another issue in production : 2023-12-29T03:13:47,467][INFO ][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager] *Starting to clean WAL archive [highIdx=8303528, currSize=512.0 MB, maxSize=1.0 GB]* 2023-12-29T03:13:47,468][INFO ][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager] Finish clean WAL archive [cleanCnt=1, currSize=448.0 MB, maxSize=1.0 GB] 2023-12-29T03:13:47,563][INFO ][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][FileWriteAheadLogManager] Copied file [src=/datastore2/wal/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/0005.wal, dst=/datastore2/archive/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/08303535.wal] 2023-12-29T03:14:17,080][INFO ][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][Fil In the above log it seems wal archive is also filling up fast. Should we also set maxWalArchiveSize to a higher value from the default 1GB. Find the logs from one of our node and this can bee seen in all the nodes [^nohup_12.out] Please help us with your observations. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, > nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: nohup_12.out > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, > nohup_12.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800933#comment-17800933 ] Vipul Thakur edited comment on IGNITE-21059 at 12/28/23 7:42 AM: - !image.png! Yes CPU(s) in each of the physical nodes is 160. was (Author: vipul.thakur): !image.png! Yes CPU(s) from one of the physical nodes is 160. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800933#comment-17800933 ] Vipul Thakur commented on IGNITE-21059: --- !image.png! Yes CPU(s) from one of the physical nodes is 160. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: image.png > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: (was: image.png) > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: image.png > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800931#comment-17800931 ] Vipul Thakur commented on IGNITE-21059: --- I will get the exact value : as per the docs {{max(8, total number of cores) this is how its calculated , i will ask my team to check it, we will also monitor pool usage.}} Still i am not sure about why the threads are stuck even after they have been timed out from client end? > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800425#comment-17800425 ] Vipul Thakur edited comment on IGNITE-21059 at 12/26/23 7:17 AM: - Hi [~zstan] *PFB for another such scenario from the logs file. This is the same kind of logs.[^ignite-server-nohup.out] , may be if you search for it in this file you will get more context for you to observe.* [2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55445052, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed to acquire lock within provided timeout for transaction [timeout=3, tx=GridDhtTxLocal[xid=5f4b66f1c81--12a3-06d7--0001, xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false, rollbackOnly=true, nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, startTime=1701334276938, duration=30003]] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) [ignite-core-2.14.0.jar:2.14.0] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351] 2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0],
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800425#comment-17800425 ] Vipul Thakur commented on IGNITE-21059: --- Hi [~zstan] *PFB for another such scenario from the logs file. This is the same kind of logs.[^ignite-server-nohup.out] , may be if you search for it in this file you will get more context for you to observe.* [2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55445052, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed to acquire lock within provided timeout for transaction [timeout=3, tx=GridDhtTxLocal[xid=5f4b66f1c81--12a3-06d7--0001, xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion [topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false, rollbackOnly=true, nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, startTime=1701334276938, duration=30003]] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86) ~[ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234) [ignite-core-2.14.0.jar:2.14.0] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) [ignite-core-2.14.0.jar:2.14.0] at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351] 2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3,
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: ignite-server-nohup-1.out > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800410#comment-17800410 ] Vipul Thakur commented on IGNITE-21059: --- * !long_txn_.png! Hi [~zstan] | [~cos] Even after client pods are timed out after 30secs we can observe this in server logs where txn are running for longer time the start time was around 14:06 and log was printed at 14:16. Please help with your observations. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: long_txn_.png > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out, long_txn_.png > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796799#comment-17796799 ] Vipul Thakur commented on IGNITE-21059: --- one of the jms listener was receiving more load than rest of the listeners. What i can understand from the frequent logs about wal being to moved to disk is causing the issue as the data is being moved there is another write request for the same entity, as it is already busy being written to disk. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796738#comment-17796738 ] Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 2:04 PM: - in server logs can't find the same, still we will look into as of now no bulk operation is implemented. [https://ignite.apache.org/docs/latest/key-value-api/transactions] as per docs the cause of timeout will be TransactionDeadlockException — cant find this anywhere either at client or server end. was (Author: vipul.thakur): in server logs can't find the same, still we will look into as of now no bulk operation is implemented. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796738#comment-17796738 ] Vipul Thakur commented on IGNITE-21059: --- in server logs can't find the same, still we will look into as of now no bulk operation is implemented. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724 ] Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:50 PM: -- Hi [~zstan] && [~cos] , today we observed the same issue in our other data center and restarting the apps helped.[this data center was running for 44 days] I am attaching all nodes logs from the cluster -> {*}Ignite_server_logs.zip{*}[in this you can find logs before the issue came] I am also attaching client services logs ---> *client-service.zip* *We are still in process of implementing your recommendation.* Please help us with your observations. was (Author: vipul.thakur): Hi [~zstan] && [~cos] , today we observed the same issue in our other data center and restarting the apps helped.[this data center was running for 44 days] I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip[in this you can find logs before the issue came] I am also attaching client services logs ---> client-service.zip *We are still in process of implementing your recommendation.* Please help us with your observations. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724 ] Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:50 PM: -- Hi [~zstan] && [~cos] , today we observed the same issue in our other data center and restarting the apps helped.[this data center was running for 44 days] I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip[in this you can find logs before the issue came] I am also attaching client services logs ---> client-service.zip *We are still in process of implementing your recommendation.* Please help us with your observations. was (Author: vipul.thakur): Hi [~zstan] , today we observed the same issue in our other data center and restarting the apps helped. I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip I am also attaching client services logs ---> client-service.zip *We are still in process of implementing your recommendation.* > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724 ] Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:48 PM: -- Hi [~zstan] , today we observed the same issue in our other data center and restarting the apps helped. I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip I am also attaching client services logs ---> client-service.zip *We are still in process of implementing your recommendation.* was (Author: vipul.thakur): Hi [~zstan] , today we observed the same issue in our other data center and restarting the apps helped. I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: client-service.zip > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > client-service.zip, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724 ] Vipul Thakur commented on IGNITE-21059: --- Hi [~zstan] , today we observed the same issue in our other data center and restarting the apps helped. I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: Ignite_server_logs.zip > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: Ignite_server_logs.zip, cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795899#comment-17795899 ] Vipul Thakur commented on IGNITE-21059: --- Thank you for your response [~zstan] Will make the above changes and let you know how it goes, will also provide you the logs from all nodes. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795867#comment-17795867 ] Vipul Thakur commented on IGNITE-21059: --- So as per my understanding i will be doing the following, please correct me if am wrong : failureDetectionTimeout , clientFailureDetectionTimeout will switch back to default values which is 10secs and 30secs will increase the walSegmentSize from default 64mb to bigger value maybe around 512mb. [limit value being 2Gb] Any comments regarding the txn timeout value which is 30secs at client. TcpDiscoveryVmIpFinder – socket timeout is 60secs at server end and 5secs at client end. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:33 PM: - We have daily requirement of 90-120 millions request for read and around 15-20 millions write requests current values : failureDetectionTimeout=12 clientFailureDetectionTimeout= 12 What would be the suggested values should we bring this closer to what socketTimeout is like 5secs and should these configuration be same at both server and client end? was (Author: vipul.thakur): We have daily requirement of 90-120 millions request for read and around 15-20 millions write requests current values : failureDetectionTimeout=12 clientFailureDetectionTimeout= 12 What would be the suggested value should bring this closer to what socketTimeout is like 5secs and should these configuration be same at both server and client end? > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:32 PM: - We have daily requirement of 90-120 millions request for read and around 15-20 millions write requests current values : failureDetectionTimeout=12 clientFailureDetectionTimeout= 12 What would be the suggested value should bring this closer to what socketTimeout is like 5secs and should these configuration be same at both server and client end? was (Author: vipul.thakur): We have daily requirement of 90-120 millions request for read and around 15-20 millions current values : failureDetectionTimeout=12 clientFailureDetectionTimeout= 12 What would be the suggested value should bring this closer to what socketTimeout is like 5secs and should these configuration be same at both server and client end? > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861 ] Vipul Thakur commented on IGNITE-21059: --- We have daily requirement of 90-120 millions request for read and around 15-20 millions current values : failureDetectionTimeout=12 clientFailureDetectionTimeout= 12 What would be the suggested value should bring this closer to what socketTimeout is like 5secs and should these configuration be same at both server and client end? > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795859#comment-17795859 ] Vipul Thakur commented on IGNITE-21059: --- We also have configured socket timeout at server and client end but from thread dump is seems like its stuck at get call in all the txns. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795859#comment-17795859 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:12 PM: - We also have configured socket timeout at server and client end but from thread dump its seems like its stuck at get call in all the txns. was (Author: vipul.thakur): We also have configured socket timeout at server and client end but from thread dump is seems like its stuck at get call in all the txns. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795858#comment-17795858 ] Vipul Thakur commented on IGNITE-21059: --- In 2.7.6 we use to observe long jvm pause logger in read services and not that much in write. Such behavior is not observed in 2.14 we have another such setup with same amount of nodes in cluster and same amount client serving as another datacenter for our api endpoint it has been running with no problems over a month now , but when we upgraded our other data center this issue occurred after just 3 days of upgrade. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 4:59 PM: - we have two k8s cluster connected to that datacenter where in each k8s cluster 10 are read , 10 are write and 2 are kind of admin service. So in total of 44 client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM and 1 Tb SDD. Long JVM pauses were observed in in 2.7.6. was (Author: vipul.thakur): we have two k8s cluster connected to that datacenter where in each k8s cluster 10 are read , 10 are write and 2 are kind of admin service. So in total of 44 client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM and 1 Tb SDD. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851 ] Vipul Thakur commented on IGNITE-21059: --- we have two k8s cluster connected to that datacenter where in each k8s cluster 10 are read , 10 are write and 2 are kind of admin service. So in total of 44 client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM and 1 Tb SDD. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 4:59 PM: - we have two k8s cluster connected to that datacenter where in each k8s cluster 10 are read , 10 are write and 2 are kind of admin service. So in total of 44 client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM and 1 Tb SDD Long JVM pauses were observed in in 2.7.6. was (Author: vipul.thakur): we have two k8s cluster connected to that datacenter where in each k8s cluster 10 are read , 10 are write and 2 are kind of admin service. So in total of 44 client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM and 1 Tb SDD. Long JVM pauses were observed in in 2.7.6. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Description: We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange. Please find the below ticket which i created a while back for ignite 2.7.6 https://issues.apache.org/jira/browse/IGNITE-13298 So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB SDD. PFB for the attached config.[I have added it as attachment for review] I have also added the server logs from the same time when issue happened. We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime. Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip. We were hoping product upgrade will help but that has not been the case till now. was: We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange. Please find the below ticket which i created a while back for ignite 2.7.6 https://issues.apache.org/jira/browse/IGNITE-13298 So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB HDD. PFB for the attached config.[I have added it as attachment for review] I have also added the server logs from the same time when issue happened. We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime. Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip. We were hoping product upgrade will help but that has not been the case till now. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB SDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795838#comment-17795838 ] Vipul Thakur commented on IGNITE-21059: --- Ok please give me sometime and we will change the wal size and let u know. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795714#comment-17795714 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 2:42 PM: - Hi Thank you for quick response, we have configured tx timeout at client end our clients are written in spring boot and java , any config needed at server's config.xml also ? We will also read about changing-wal-segment-size and make the changes accordingly was (Author: vipul.thakur): Hi Thank you for quick response, we have configured tx timeout at client end our clients are written in spring boot and java , is it needed at server's config.xml also ? We will also read about chaning-wal-segment-size and make the changes accordingly > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795735#comment-17795735 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 1:08 PM: - Evidence that txn timeout is enabled at client end : Below are the server logs: 2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55444220, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] [2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], threadId=897, futId=a3ba6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, *timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ,* retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55444392, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed to acquire lock within provided timeout for transaction [timeout=3, tx=GridDhtTxLocal[xid=c8a166f1c81--12a3-06d7--0001, xidVersion=GridCacheVersion [topVer=312674007, order=1701333834380, nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false, rollbackOnly=true, nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, startTime=1701334154571, {*}duration=30003]{*}] was (Author: vipul.thakur): Evidence that txn timeout is enabled at client end : 2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55444220, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] [2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007,
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795735#comment-17795735 ] Vipul Thakur commented on IGNITE-21059: --- Evidence that txn timeout is enabled at client end : 2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55444220, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] [2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache] Failed to acquire lock for request: GridNearLockRequest [topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest [nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], threadId=897, futId=a3ba6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, *timeout=3, isInTx=true, isInvalidate=false, isRead=true, isolation=REPEATABLE_READ,* retVals=[true], txSize=0, flags=0, keysCnt=1, super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], committedVers=null, rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, super=GridCacheMessage [msgId=55444392, depInfo=null, lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], err=null, skipPrepare=false] org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed to acquire lock within provided timeout for transaction [timeout=3, tx=GridDhtTxLocal[xid=c8a166f1c81--12a3-06d7--0001, xidVersion=GridCacheVersion [topVer=312674007, order=1701333834380, nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion [topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, invalidate=false, rollbackOnly=true, nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, startTime=1701334154571, {*}duration=30003]{*}] > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795714#comment-17795714 ] Vipul Thakur commented on IGNITE-21059: --- Hi Thank you for quick response, we have configured tx timeout at client end our clients are written in spring boot and java , is it needed at server's config.xml also ? We will also read about chaning-wal-segment-size and make the changes accordingly > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795616#comment-17795616 ] Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 6:59 AM: - [~cos] Please help in review was (Author: vipul.thakur): @cos Please help in review > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795616#comment-17795616 ] Vipul Thakur commented on IGNITE-21059: --- @cos Please help in review > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Attachment: ignite-server-nohup.out > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-21059: -- Description: We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange. Please find the below ticket which i created a while back for ignite 2.7.6 https://issues.apache.org/jira/browse/IGNITE-13298 So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB HDD. PFB for the attached config.[I have added it as attachment for review] I have also added the server logs from the same time when issue happened. We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime. Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip. We were hoping product upgrade will help but that has not been the case till now. was: We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange. Please find the below ticket which i created a while back for ignite 2.7.6 https://issues.apache.org/jira/browse/IGNITE-13298 So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB HDD. PFB for the attached config.[I have added it as attachment for review] We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime. Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip. We were hoping product upgrade will help but that has not been the case till now. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, > ignite-server-nohup.out > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > I have also added the server logs from the same time when issue happened. > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
[ https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795614#comment-17795614 ] Vipul Thakur commented on IGNITE-21059: --- Hi Please review and comment and let me know if more info is needed. > We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running > cache operations > > > Key: IGNITE-21059 > URL: https://issues.apache.org/jira/browse/IGNITE-21059 > Project: Ignite > Issue Type: Bug > Components: binary, clients >Affects Versions: 2.14 >Reporter: Vipul Thakur >Priority: Critical > Attachments: cache-config-1.xml, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, > digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, > digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2 > > > We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in > production environment where cluster would go in hang state due to partition > map exchange. > Please find the below ticket which i created a while back for ignite 2.7.6 > https://issues.apache.org/jira/browse/IGNITE-13298 > So we migrated the apache ignite version to 2.14 and upgrade happened > smoothly but on the third day we could see cluster traffic dip again. > We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 > TB HDD. > PFB for the attached config.[I have added it as attachment for review] > We have set txn timeout as well as socket timeout both at server and client > end for our write operations but seems like sometimes cluster goes into hang > state and all our get calls are stuck and slowly everything starts to freeze > our jms listener threads and every thread reaches a choked up state in > sometime. > Due to which our read services which does not even use txn to retrieve data > also starts to choke. Ultimately leading to end user traffic dip. > We were hoping product upgrade will help but that has not been the case till > now. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations
Vipul Thakur created IGNITE-21059: - Summary: We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations Key: IGNITE-21059 URL: https://issues.apache.org/jira/browse/IGNITE-21059 Project: Ignite Issue Type: Bug Components: binary, clients Affects Versions: 2.14 Reporter: Vipul Thakur Attachments: cache-config-1.xml, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2 We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in production environment where cluster would go in hang state due to partition map exchange. Please find the below ticket which i created a while back for ignite 2.7.6 https://issues.apache.org/jira/browse/IGNITE-13298 So we migrated the apache ignite version to 2.14 and upgrade happened smoothly but on the third day we could see cluster traffic dip again. We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB HDD. PFB for the attached config.[I have added it as attachment for review] We have set txn timeout as well as socket timeout both at server and client end for our write operations but seems like sometimes cluster goes into hang state and all our get calls are stuck and slowly everything starts to freeze our jms listener threads and every thread reaches a choked up state in sometime. Due to which our read services which does not even use txn to retrieve data also starts to choke. Ultimately leading to end user traffic dip. We were hoping product upgrade will help but that has not been the case till now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-6894) Hanged Tx monitoring
[ https://issues.apache.org/jira/browse/IGNITE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166370#comment-17166370 ] Vipul Thakur commented on IGNITE-6894: -- is this resolved in any of the version , we are facing this issue. > Hanged Tx monitoring > > > Key: IGNITE-6894 > URL: https://issues.apache.org/jira/browse/IGNITE-6894 > Project: Ignite > Issue Type: Improvement >Reporter: Anton Vinogradov >Assignee: Dmitriy Sorokin >Priority: Major > Labels: iep-7 > > Hanging Transactions not Related to Deadlock > Description > This situation can occur if user explicitly markups the transaction (esp > Pessimistic Repeatable Read) and, for example, calls remote service (which > may be unresponsive) after acquiring some locks. All other transactions > depending on the same keys will hang. > Detection and Solution > This most likely cannot be resolved automatically other than rollback TX by > timeout and release all the locks acquired so far. Also such TXs can be > rolled back from Web Console as described above. > If transaction has been rolled back on timeout or via UI then any further > action in the transaction, e.g. lock acquisition or commit attempt should > throw exception. > Report > Management tools (eg. Web Console) should provide ability to rollback any > transaction via UI. > Long running transaction should be reported to logs. Log record should > contain: near nodes, transaction IDs, cache names, keys (limited to several > tens of), etc ( ?). > Also there should be a screen in Web Console that will list all ongoing > transactions in the cluster including the info as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13298) Found long running cache at client end
[ https://issues.apache.org/jira/browse/IGNITE-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164776#comment-17164776 ] Vipul Thakur commented on IGNITE-13298: --- cluster memory config/persistence is in environment section at top. > Found long running cache at client end > --- > > Key: IGNITE-13298 > URL: https://issues.apache.org/jira/browse/IGNITE-13298 > Project: Ignite > Issue Type: Task >Affects Versions: 2.7.6 > Environment: cluster memory > config/persistence > class="org.apache.ignite.logger.log4j2.Log4J2Logger"> > value="${IGNITE_SCRIPT}/ignite-log4j2.xml" /> > > > > > > > > > value="${checkpointPageBufferSize}" /> > value="${storagePath}" /> value="${walPath}" /> value="${walArchivePath}" /> value="LOG_ONLY" /> value="${pageSize}" /> > > name="metricsEnabled" value="true"/> > ==Client thread dump === > 2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) > 64-Bit Server VM (25.211-b12 mixed mode): > "Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 > nid=0x32d waiting on condition [0x] java.lang.Thread.State: > RUNNABLE > Locked ownable synchronizers: - None > "Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 > tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000] > java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native > Method) - parking to wait for <0xcb87d9d0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) > at > com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110) > at > com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - None > "DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 > tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) at > java.lang.Object.wait(Native Method) at > com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130) > at > com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845) > - locked <0xcb8cce50> (a > com.solacesystems.jcsmp.impl.XMLMessageQueueList) at > com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) > at > org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86) > at > org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - None > "Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 > tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000] > java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native > Method) - parking to wait for <0xcb8cf8d0> (a >
[jira] [Updated] (IGNITE-13298) Found long running cache at client end
[ https://issues.apache.org/jira/browse/IGNITE-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vipul Thakur updated IGNITE-13298: -- Issue Type: Task (was: Bug) Priority: Blocker (was: Major) > Found long running cache at client end > --- > > Key: IGNITE-13298 > URL: https://issues.apache.org/jira/browse/IGNITE-13298 > Project: Ignite > Issue Type: Task >Affects Versions: 2.7.6 > Environment: cluster memory > config/persistence > class="org.apache.ignite.logger.log4j2.Log4J2Logger"> > value="${IGNITE_SCRIPT}/ignite-log4j2.xml" /> > > > > > > > > > value="${checkpointPageBufferSize}" /> > value="${storagePath}" /> value="${walPath}" /> value="${walArchivePath}" /> value="LOG_ONLY" /> value="${pageSize}" /> > > name="metricsEnabled" value="true"/> > ==Client thread dump === > 2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) > 64-Bit Server VM (25.211-b12 mixed mode): > "Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 > nid=0x32d waiting on condition [0x] java.lang.Thread.State: > RUNNABLE > Locked ownable synchronizers: - None > "Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 > tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000] > java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native > Method) - parking to wait for <0xcb87d9d0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) > at > com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110) > at > com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - None > "DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 > tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) at > java.lang.Object.wait(Native Method) at > com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130) > at > com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845) > - locked <0xcb8cce50> (a > com.solacesystems.jcsmp.impl.XMLMessageQueueList) at > com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) > at > org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86) > at > org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303) > at > org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179) > at > org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - None > "Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 > tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000] > java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native > Method) - parking to wait for <0xcb8cf8d0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at >
[jira] [Created] (IGNITE-13298) Found long running cache at client end
Vipul Thakur created IGNITE-13298: - Summary: Found long running cache at client end Key: IGNITE-13298 URL: https://issues.apache.org/jira/browse/IGNITE-13298 Project: Ignite Issue Type: Bug Affects Versions: 2.7.6 Environment: cluster memory config/persistence ==Client thread dump === 2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode): "Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 nid=0x32d waiting on condition [0x] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xcb87d9d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) at com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110) at com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None "DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130) at com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845) - locked <0xcb8cce50> (a com.solacesystems.jcsmp.impl.XMLMessageQueueList) at com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) at org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86) at org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None "Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xcb8cf8d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) at com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110) at com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None "DefaultMessageListenerContainer-27" #784 prio=5 os_prio=0