[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-24 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805720#comment-17805720
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 1/25/24 7:08 AM:


I also ran 

*{{control.sh|bat --cache contention 5}}*

*OUTPUT*

JVM_OPTS environment variable is set, but will not be used. To pass JVM options 
use CONTROL_JVM_OPTS

JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 


INFO: Client TCP connection established: localhost/127.0.0.1:11211

2024-01-11T22:40:23,579][INFO 
][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%|#25%TcpCommunicationSpi%][TcpCommunicationSpi]
 Established outgoing communication connection [locAddr=x.x.x.x:41264, 
rmtAddr=/x.x.x.x:47100]

2024-01-11T22:40:23,594][INFO 
][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%|#26%TcpCommunicationSpi%][TcpCommunicationSpi]
 Established outgoing communication connection [locAddr=/x.x.x.x:56674, 
rmtAddr=/x.x.x.x:47100]

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 
close

INFO: Client TCP connection closed: localhost/127.0.0.1:11211

Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils 
shutdownNow

WARNING: Runnable tasks outlived thread pool executor service 
[owner=GridClientConnectionManager, 
tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]]

[node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList [x.x.x.x 
127.0.0.1], sockAddrs=null, discPort=47500, order=90, intOrder=48, 
lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]]

[node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, 
consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList [x.x.x.x 
127.0.0.1], sockAddrs=null, discPort=47500, order=88, intOrder=47, 
lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]]

[node=TcpDiscoveryNode [id=855b22e7-0ad7-4521-ab53-3af65b6fce73, 
consistentId=ee70a820-92a5-48c7-a5da-4965c946b550, addrs=ArrayList [x.x.x.x, 
127.0.0.1], sockAddrs=null, discPort=47500, order=4, intOrder=4, 
lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]]

Control utility [ver. 2.14.0#20220929-sha1:951e8deb]

2022 Copyright(C) Apache Software Foundation

Time: 2024-01-11T22:40:22.947

Command [CACHE] started

Arguments: --host localhost --port 11211 --user  --password * --cache 
contention 5



Command [CACHE] finished with code: 0

Control utility has completed execution at: 2024-01-11T22:40:23.734

Execution time: 787 ms


was (Author: vipul.thakur):
I also ran 

*{{control.sh|bat --cache contention 5}}*

*OUTPUT*

JVM_OPTS environment variable is set, but will not be used. To pass JVM options 
use CONTROL_JVM_OPTS

JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 


INFO: Client TCP connection established: localhost/127.0.0.1:11211

2024-01-11T22:40:23,579][INFO 
][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:41264, 
rmtAddr=/10.135.34.68:47100]

2024-01-11T22:40:23,594][INFO 
][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:56674, 
rmtAddr=/10.135.34.67:47100]

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 
close

INFO: Client TCP connection closed: localhost/127.0.0.1:11211

Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils 
shutdownNow

WARNING: Runnable tasks outlived thread pool executor service 
[owner=GridClientConnectionManager, 
tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]]

[node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList 
[10.135.34.67, 127.0.0.1], sockAddrs=null, discPort=47500, order=90, 
intOrder=48, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, 
consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList 
[10.135.34.53, 127.0.0.1], sockAddrs=null, discPort=47500, order=88, 
intOrder=47, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806067#comment-17806067
 ] 

Vipul Thakur commented on IGNITE-21059:
---

As i said all the nodes are from same data center and we dont have any kind 
restrictions in terms of connectivity, could it be a network fluctuation , is 
there any way to benchmark wrt ignite ? coz we also think this is happening due 
to that but we dont have a way to benchmark it to our n/w team.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805949#comment-17805949
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 1/12/24 9:08 AM:


All of them are from same data center and there is no firewall and all the 
required ports are open as well. It happened when i stopped the node and 
restarted it, 2nd time.

 


was (Author: vipul.thakur):
All of them are from same data center and there is no firewall and all the 
required ports are open as well. It happened when i stopped the node and 
restarted it.

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805949#comment-17805949
 ] 

Vipul Thakur commented on IGNITE-21059:
---

All of them are from same data center and there is no firewall and all the 
required ports are open as well. It happened when i stopped the node and 
restarted it.

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805720#comment-17805720
 ] 

Vipul Thakur commented on IGNITE-21059:
---

I also ran 

*{{control.sh|bat --cache contention 5}}*

*OUTPUT*

JVM_OPTS environment variable is set, but will not be used. To pass JVM options 
use CONTROL_JVM_OPTS

JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 


INFO: Client TCP connection established: localhost/127.0.0.1:11211

2024-01-11T22:40:23,579][INFO 
][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:41264, 
rmtAddr=/10.135.34.68:47100]

2024-01-11T22:40:23,594][INFO 
][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:56674, 
rmtAddr=/10.135.34.67:47100]

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 
close

INFO: Client TCP connection closed: localhost/127.0.0.1:11211

Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils 
shutdownNow

WARNING: Runnable tasks outlived thread pool executor service 
[owner=GridClientConnectionManager, 
tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]]

[node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList 
[10.135.34.67, 127.0.0.1], sockAddrs=null, discPort=47500, order=90, 
intOrder=48, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, 
consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList 
[10.135.34.53, 127.0.0.1], sockAddrs=null, discPort=47500, order=88, 
intOrder=47, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode [id=855b22e7-0ad7-4521-ab53-3af65b6fce73, 
consistentId=ee70a820-92a5-48c7-a5da-4965c946b550, addrs=ArrayList 
[10.135.34.68, 127.0.0.1], sockAddrs=null, discPort=47500, order=4, intOrder=4, 
lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]]

Control utility [ver. 2.14.0#20220929-sha1:951e8deb]

2022 Copyright(C) Apache Software Foundation

Time: 2024-01-11T22:40:22.947

Command [CACHE] started

Arguments: --host localhost --port 11211 --user  --password * --cache 
contention 5 



Command [CACHE] finished with code: 0

Control utility has completed execution at: 2024-01-11T22:40:23.734

Execution time: 787 ms

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts 

[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805709#comment-17805709
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 1/11/24 4:59 PM:


Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to 
optimistic and isolation level to serializable with 5s txn timeout and ran long 
running load with low traffic only like we have multiple jms listeners which 
communicate with ignite while writing data and during the load i restarted one 
node to mimic the change n/w topology of the cluster, so for the first time 
when i did this nothing happened and but when i did the next time with another 
node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling 
up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from 
all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and 
rolled back failuredetection timeout and client failuredetection timeout to 
default values.

Please help us with your observations.

 

I have also modified my code to detect thread deadlock like below : 

 

!image-2024-01-11-22-28-51-501.png|width=638,height=248!


was (Author: vipul.thakur):
Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to 
optimistic and isolation level to serializable with 5s txn timeout and ran long 
running load with low traffic only like we have multiple jms listeners which 
communicate with ignite while writing data and during the load i restarted one 
node to mimic the change n/w topology of the cluster, so for the first time 
when i did this nothing happened and but when i did the next time with another 
node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling 
up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from 
all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and 
rolled back failuredetection timeout and client failuredetection timeout to 
default values.

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805709#comment-17805709
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to 
optimistic and isolation level to serializable with 5s txn timeout and ran long 
running load with low traffic only like we have multiple jms listeners which 
communicate with ignite while writing data and during the load i restarted one 
node to mimic the change n/w topology of the cluster, so for the first time 
when i did this nothing happened and but when i did the next time with another 
node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling 
up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from 
all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and 
rolled back failuredetection timeout and client failuredetection timeout to 
default values.

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: ignite_issue_1101.zip

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-01 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801646#comment-17801646
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] 

Thank you for the observation.

 

We have also observed a new exception related to striped pool : 

 

2023-12-29 16:41:09.426 ERROR 1 --- [api.endpoint-22] 
b.b.EventProcessingErrorHandlerJmsSender : >>> Published error 
message ..EventProcessingErrorHandlerJmsSender ..
*2023-12-29 16:41:09.569  WARN 1 --- [85b8d7f7-ntw27%] 
o.a.i.i.processors.pool.PoolProcessor    : >>> Possible starvation in striped 
pool.*
    *Thread name: 
sys-stripe-0-#1%DIGITALAPI__PRIMARY_digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27%*
    Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_TX, 
topicOrd=20, ordered=false, timeout=0, skipOnTimeout=false, msg=TxLocksResponse 
[futId=2236, nearTxKeyLocks=HashMap {}, txKeys=null]]], Message closure 
[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, 
timeout=0, skipOnTimeout=false, msg=GridNearLockResponse [pending=ArrayList [], 
miniId=1, dhtVers=GridCacheVersion[] [GridCacheVersion [topVer=312674347, 
order=1703970204663, nodeOrder=2, dataCenterId=0]], 
mappedVers=GridCacheVersion[] [GridCacheVersion [topVer=315266949, 
order=1703839756326, nodeOrder=2, dataCenterId=0]], clientRemapVer=null, 
compatibleRemapVer=false, super=GridDistributedLockResponse 
[futId=b9a9f75bc81-870cf83b-d2dd-4aa0-9d9f-bffdb8d46b1a, err=null, 
vals=ArrayList [BinaryObjectImpl [arr= true, ctx=false, start=0]], 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=315266949, 
order=1703839751829, nodeOrder=11, dataCenterId=0], commit

PFB for detailed logs

[^digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log]

 

Could be it due to having too many read client that our write services are 
getting affected. 

Should we be trying to decrease the no read services?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-01 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-29 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801194#comment-17801194
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] 

 

Today we got another issue in production : 


2023-12-29T03:13:47,467][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 *Starting to clean WAL archive [highIdx=8303528, currSize=512.0 MB, 
maxSize=1.0 GB]*
2023-12-29T03:13:47,468][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Finish clean WAL archive [cleanCnt=1, currSize=448.0 MB, maxSize=1.0 GB]
2023-12-29T03:13:47,563][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Copied file 
[src=/datastore2/wal/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/0005.wal,
 
dst=/datastore2/archive/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/08303535.wal]
2023-12-29T03:14:17,080][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][Fil

 

In the above log it seems wal archive is also filling up fast. 

Should we also set  maxWalArchiveSize to a higher value from the default 1GB.

Find the logs from one of our node and this can bee seen in all the nodes

[^nohup_12.out]

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-29 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: nohup_12.out

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800933#comment-17800933
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/28/23 7:42 AM:
-

!image.png!

 

Yes CPU(s) in each of the physical nodes is 160.


was (Author: vipul.thakur):
!image.png!

 

Yes CPU(s) from one of the physical nodes is 160.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800933#comment-17800933
 ] 

Vipul Thakur commented on IGNITE-21059:
---

!image.png!

 

Yes CPU(s) from one of the physical nodes is 160.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: image.png

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: (was: image.png)

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: image.png

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800931#comment-17800931
 ] 

Vipul Thakur commented on IGNITE-21059:
---

I will get the exact value : as per the docs  {{max(8, total number of cores) 
this is how its calculated , i will ask my team to check it, we will also 
monitor pool usage.}}

Still i am not sure about why the threads are stuck even after they have been 
timed out from client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800425#comment-17800425
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/26/23 7:17 AM:
-

Hi 

[~zstan] 

*PFB for another such scenario from the logs file. This is the same kind of 
logs.[^ignite-server-nohup.out] , may be if you search for it in this file you 
will get more context for you to observe.* 

[2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55445052, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=5f4b66f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334276938, duration=30003]]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) 
[ignite-core-2.14.0.jar:2.14.0]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351]
2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800425#comment-17800425
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi 

[~zstan] 

*PFB for another such scenario from the logs file. This is the same kind of 
logs.[^ignite-server-nohup.out] , may be if you search for it in this file you 
will get more context for you to observe.* 

[2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55445052, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=5f4b66f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334276938, duration=30003]]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) 
[ignite-core-2.14.0.jar:2.14.0]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351]
2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, 

[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: ignite-server-nohup-1.out

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800410#comment-17800410
 ] 

Vipul Thakur commented on IGNITE-21059:
---

* !long_txn_.png!

Hi 

 

[~zstan]  | [~cos]

Even after client pods are timed out after 30secs we can observe this in server 
logs where txn are running for longer time the start time was around 14:06 and 
log was printed at 14:16. 

Please help with your observations.

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: long_txn_.png

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796799#comment-17796799
 ] 

Vipul Thakur commented on IGNITE-21059:
---

one of the jms listener was receiving more load  than rest of the listeners. 
What i can understand from the frequent logs about wal being to moved to disk 
is causing the issue as the data is being moved there is another write request 
for the same entity, as it is already busy being written to disk. 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796738#comment-17796738
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 2:04 PM:
-

in server logs can't find the same, still we will look into as of now no bulk 
operation is implemented.

 

[https://ignite.apache.org/docs/latest/key-value-api/transactions]

 

as per docs the cause of timeout will be TransactionDeadlockException — cant 
find this anywhere either at client or server end.


was (Author: vipul.thakur):
in server logs can't find the same, still we will look into as of now no bulk 
operation is implemented.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796738#comment-17796738
 ] 

Vipul Thakur commented on IGNITE-21059:
---

in server logs can't find the same, still we will look into as of now no bulk 
operation is implemented.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:50 PM:
--

Hi [~zstan] && [~cos]  , today we observed the same issue in our other data 
center and restarting the apps helped.[this data center was running for 44 days]

I am attaching all nodes logs from the cluster -> 
{*}Ignite_server_logs.zip{*}[in this you can find logs before the issue came]

I am also attaching client services logs ---> *client-service.zip*

*We are still in process of implementing your recommendation.*

Please help us with your observations.


was (Author: vipul.thakur):
Hi [~zstan] && [~cos]  , today we observed the same issue in our other data 
center and restarting the apps helped.[this data center was running for 44 days]

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip[in 
this you can find logs before the issue came]

I am also attaching client services logs ---> client-service.zip

*We are still in process of implementing your recommendation.*

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:50 PM:
--

Hi [~zstan] && [~cos]  , today we observed the same issue in our other data 
center and restarting the apps helped.[this data center was running for 44 days]

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip[in 
this you can find logs before the issue came]

I am also attaching client services logs ---> client-service.zip

*We are still in process of implementing your recommendation.*

Please help us with your observations.


was (Author: vipul.thakur):
Hi [~zstan] , today we observed the same issue in our other data center and 
restarting the apps helped.

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip

I am also attaching client services logs ---> client-service.zip

*We are still in process of implementing your recommendation.*

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/14/23 12:48 PM:
--

Hi [~zstan] , today we observed the same issue in our other data center and 
restarting the apps helped.

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip

I am also attaching client services logs ---> client-service.zip

*We are still in process of implementing your recommendation.*


was (Author: vipul.thakur):
Hi [~zstan] , today we observed the same issue in our other data center and 
restarting the apps helped.

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: client-service.zip

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] , today we observed the same issue in our other data center and 
restarting the apps helped.

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: Ignite_server_logs.zip

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795899#comment-17795899
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Thank you for your response [~zstan]  

Will make the above changes and let you know how it goes, will also provide you 
the logs from all nodes.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795867#comment-17795867
 ] 

Vipul Thakur commented on IGNITE-21059:
---

So as per my understanding i will be doing the following, please correct me if 
am wrong : 

failureDetectionTimeout , clientFailureDetectionTimeout will switch back to 
default values which is 10secs and 30secs

will increase the walSegmentSize from default 64mb to bigger value maybe around 
512mb. [limit value being 2Gb]

Any comments regarding the txn timeout value which is 30secs at client.

TcpDiscoveryVmIpFinder – socket timeout is 60secs at server end and 5secs at 
client end.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:33 PM:
-

We have daily requirement of 90-120 millions request for read and around 15-20 
millions write requests

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested values should we bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?


was (Author: vipul.thakur):
We have daily requirement of 90-120 millions request for read and around 15-20 
millions write requests

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested value should bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:32 PM:
-

We have daily requirement of 90-120 millions request for read and around 15-20 
millions write requests

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested value should bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?


was (Author: vipul.thakur):
We have daily requirement of 90-120 millions request for read and around 15-20 
millions 

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested value should bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861
 ] 

Vipul Thakur commented on IGNITE-21059:
---

We have daily requirement of 90-120 millions request for read and around 15-20 
millions 

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested value should bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795859#comment-17795859
 ] 

Vipul Thakur commented on IGNITE-21059:
---

We also have configured socket timeout at server and client end but from thread 
dump is seems like its stuck at get call in all the txns.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795859#comment-17795859
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 5:12 PM:
-

We also have configured socket timeout at server and client end but from thread 
dump its seems like its stuck at get call in all the txns.


was (Author: vipul.thakur):
We also have configured socket timeout at server and client end but from thread 
dump is seems like its stuck at get call in all the txns.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795858#comment-17795858
 ] 

Vipul Thakur commented on IGNITE-21059:
---

In 2.7.6 we use to observe long jvm pause logger in read services and not that 
much in write. 

Such behavior is not observed in 2.14 we have another such setup with same 
amount of nodes in cluster and same amount client serving as another datacenter 
for our api endpoint it has been running with no problems over a month now , 
but when we upgraded our other data center this issue occurred after just 3 
days of upgrade.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 4:59 PM:
-

we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD.

 

Long JVM pauses were observed in in 2.7.6.


was (Author: vipul.thakur):
we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851
 ] 

Vipul Thakur commented on IGNITE-21059:
---

we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 4:59 PM:
-

we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD

Long JVM pauses were observed in in 2.7.6.


was (Author: vipul.thakur):
we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD.

 

Long JVM pauses were observed in in 2.7.6.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Description: 
We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
production environment where cluster would go in hang state due to partition 
map exchange.

Please find the below ticket which i created a while back for ignite 2.7.6

https://issues.apache.org/jira/browse/IGNITE-13298

So we migrated the apache ignite version to 2.14 and upgrade happened smoothly 
but on the third day we could see cluster traffic dip again. 

We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB 
SDD.

PFB for the attached config.[I have added it as attachment for review]

I have also added the server logs from the same time when issue happened.

We have set txn timeout as well as socket timeout both at server and client end 
for our write operations but seems like sometimes cluster goes into hang state 
and all our get calls are stuck and slowly everything starts to freeze our jms 
listener threads and every thread reaches a choked up state in sometime.

Due to which our read services which does not even use txn to retrieve data 
also starts to choke. Ultimately leading to end user traffic dip.

We were hoping product upgrade will help but that has not been the case till 
now. 

 

 

 

 

 

 

  was:
We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
production environment where cluster would go in hang state due to partition 
map exchange.

Please find the below ticket which i created a while back for ignite 2.7.6

https://issues.apache.org/jira/browse/IGNITE-13298

So we migrated the apache ignite version to 2.14 and upgrade happened smoothly 
but on the third day we could see cluster traffic dip again. 

We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB 
HDD.

PFB for the attached config.[I have added it as attachment for review]

I have also added the server logs from the same time when issue happened.

We have set txn timeout as well as socket timeout both at server and client end 
for our write operations but seems like sometimes cluster goes into hang state 
and all our get calls are stuck and slowly everything starts to freeze our jms 
listener threads and every thread reaches a choked up state in sometime.

Due to which our read services which does not even use txn to retrieve data 
also starts to choke. Ultimately leading to end user traffic dip.

We were hoping product upgrade will help but that has not been the case till 
now. 

 

 

 

 

 

 


> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795838#comment-17795838
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Ok please give me sometime and we will change the wal size and let u know.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795714#comment-17795714
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 2:42 PM:
-

Hi 

Thank you for quick response, we have configured tx timeout at client end our 
clients are written in spring boot and java , any config needed at server's 
config.xml also ? 

We will also read about changing-wal-segment-size and make the changes 
accordingly 


was (Author: vipul.thakur):
Hi 

Thank you for quick response, we have configured tx timeout at client end our 
clients are written in spring boot and java , is it needed at server's 
config.xml also ? 

We will also read about chaning-wal-segment-size and make the changes 
accordingly 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795735#comment-17795735
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 1:08 PM:
-

Evidence that txn timeout is enabled at client end : 

Below are the server logs: 

2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], 
threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444220, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
[2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%|#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
threadId=897, futId=a3ba6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
*timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ,* retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641190, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444392, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=c8a166f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333834380, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334154571, {*}duration=30003]{*}]


was (Author: vipul.thakur):
Evidence that txn timeout is enabled at client end : 

 

2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], 
threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444220, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
[2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795735#comment-17795735
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Evidence that txn timeout is enabled at client end : 

 

2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], 
threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444220, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
[2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
threadId=897, futId=a3ba6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
*timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ,* retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641190, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444392, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=c8a166f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333834380, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334154571, {*}duration=30003]{*}]

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795714#comment-17795714
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi 

Thank you for quick response, we have configured tx timeout at client end our 
clients are written in spring boot and java , is it needed at server's 
config.xml also ? 

We will also read about chaning-wal-segment-size and make the changes 
accordingly 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795616#comment-17795616
 ] 

Vipul Thakur edited comment on IGNITE-21059 at 12/12/23 6:59 AM:
-

[~cos]  Please help in review


was (Author: vipul.thakur):
@cos Please help in review

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795616#comment-17795616
 ] 

Vipul Thakur commented on IGNITE-21059:
---

@cos Please help in review

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Attachment: ignite-server-nohup.out

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-21059:
--
Description: 
We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
production environment where cluster would go in hang state due to partition 
map exchange.

Please find the below ticket which i created a while back for ignite 2.7.6

https://issues.apache.org/jira/browse/IGNITE-13298

So we migrated the apache ignite version to 2.14 and upgrade happened smoothly 
but on the third day we could see cluster traffic dip again. 

We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB 
HDD.

PFB for the attached config.[I have added it as attachment for review]

I have also added the server logs from the same time when issue happened.

We have set txn timeout as well as socket timeout both at server and client end 
for our write operations but seems like sometimes cluster goes into hang state 
and all our get calls are stuck and slowly everything starts to freeze our jms 
listener threads and every thread reaches a choked up state in sometime.

Due to which our read services which does not even use txn to retrieve data 
also starts to choke. Ultimately leading to end user traffic dip.

We were hoping product upgrade will help but that has not been the case till 
now. 

 

 

 

 

 

 

  was:
We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
production environment where cluster would go in hang state due to partition 
map exchange.

Please find the below ticket which i created a while back for ignite 2.7.6

https://issues.apache.org/jira/browse/IGNITE-13298

So we migrated the apache ignite version to 2.14 and upgrade happened smoothly 
but on the third day we could see cluster traffic dip again. 

We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB 
HDD.

PFB for the attached config.[I have added it as attachment for review]

We have set txn timeout as well as socket timeout both at server and client end 
for our write operations but seems like sometimes cluster goes into hang state 
and all our get calls are stuck and slowly everything starts to freeze our jms 
listener threads and every thread reaches a choked up state in sometime.

Due to which our read services which does not even use txn to retrieve data 
also starts to choke. Ultimately leading to end user traffic dip.

We were hoping product upgrade will help but that has not been the case till 
now. 

 

 

 

 

 

 


> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795614#comment-17795614
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi Please review and comment and let me know if more info is needed.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)
Vipul Thakur created IGNITE-21059:
-

 Summary: We have upgraded our ignite instance from 2.7.6 to 2.14. 
Found long running cache operations
 Key: IGNITE-21059
 URL: https://issues.apache.org/jira/browse/IGNITE-21059
 Project: Ignite
  Issue Type: Bug
  Components: binary, clients
Affects Versions: 2.14
Reporter: Vipul Thakur
 Attachments: cache-config-1.xml, 
digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2

We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
production environment where cluster would go in hang state due to partition 
map exchange.

Please find the below ticket which i created a while back for ignite 2.7.6

https://issues.apache.org/jira/browse/IGNITE-13298

So we migrated the apache ignite version to 2.14 and upgrade happened smoothly 
but on the third day we could see cluster traffic dip again. 

We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 TB 
HDD.

PFB for the attached config.[I have added it as attachment for review]

We have set txn timeout as well as socket timeout both at server and client end 
for our write operations but seems like sometimes cluster goes into hang state 
and all our get calls are stuck and slowly everything starts to freeze our jms 
listener threads and every thread reaches a choked up state in sometime.

Due to which our read services which does not even use txn to retrieve data 
also starts to choke. Ultimately leading to end user traffic dip.

We were hoping product upgrade will help but that has not been the case till 
now. 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-6894) Hanged Tx monitoring

2020-07-28 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166370#comment-17166370
 ] 

Vipul Thakur commented on IGNITE-6894:
--

is this resolved in any of the version , we are facing this issue.

> Hanged Tx monitoring
> 
>
> Key: IGNITE-6894
> URL: https://issues.apache.org/jira/browse/IGNITE-6894
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-7
>
> Hanging Transactions not Related to Deadlock
> Description
>  This situation can occur if user explicitly markups the transaction (esp 
> Pessimistic Repeatable Read) and, for example, calls remote service (which 
> may be unresponsive) after acquiring some locks. All other transactions 
> depending on the same keys will hang.
> Detection and Solution
>  This most likely cannot be resolved automatically other than rollback TX by 
> timeout and release all the locks acquired so far. Also such TXs can be 
> rolled back from Web Console as described above.
>  If transaction has been rolled back on timeout or via UI then any further 
> action in the transaction, e.g. lock acquisition or commit attempt should 
> throw exception.
> Report
> Management tools (eg. Web Console) should provide ability to rollback any 
> transaction via UI.
>  Long running transaction should be reported to logs. Log record should 
> contain: near nodes, transaction IDs, cache names, keys (limited to several 
> tens of), etc ( ?).
> Also there should be a screen in Web Console that will list all ongoing 
> transactions in the cluster including the info as above.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IGNITE-13298) Found long running cache at client end

2020-07-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164776#comment-17164776
 ] 

Vipul Thakur commented on IGNITE-13298:
---

cluster memory config/persistence is in environment section at top.

> Found long running cache at client end 
> ---
>
> Key: IGNITE-13298
> URL: https://issues.apache.org/jira/browse/IGNITE-13298
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7.6
> Environment: cluster memory 
> config/persistence 
>               class="org.apache.ignite.logger.log4j2.Log4J2Logger">                
>  value="${IGNITE_SCRIPT}/ignite-log4j2.xml" />                    
>                     
>        
>                              
>         
>                                 
>                            
>                                            
>                                   
>      value="${checkpointPageBufferSize}" />                                 
>                     value="${storagePath}" />                 value="${walPath}" />                 value="${walArchivePath}" />                 value="LOG_ONLY" />                 value="${pageSize}" />                       
>                        
>                     name="metricsEnabled" value="true"/>                    
> ==Client thread dump ===
> 2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) 
> 64-Bit Server VM (25.211-b12 mixed mode):
> "Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 
> nid=0x32d waiting on condition [0x]   java.lang.Thread.State: 
> RUNNABLE
>    Locked ownable synchronizers: - None
> "Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 
> tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000]   
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0xcb87d9d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>  at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) 
> at 
> com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110)
>  at 
> com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130)
>  at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers: - None
> "DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 
> tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000]   
> java.lang.Thread.State: TIMED_WAITING (on object monitor) at 
> java.lang.Object.wait(Native Method) at 
> com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130)
>  at 
> com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845)
>  - locked <0xcb8cce50> (a 
> com.solacesystems.jcsmp.impl.XMLMessageQueueList) at 
> com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) 
> at 
> org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86)
>  at 
> org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076)
>  at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers: - None
> "Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 
> tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000]   
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0xcb8cf8d0> (a 
> 

[jira] [Updated] (IGNITE-13298) Found long running cache at client end

2020-07-25 Thread Vipul Thakur (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vipul Thakur updated IGNITE-13298:
--
Issue Type: Task  (was: Bug)
  Priority: Blocker  (was: Major)

> Found long running cache at client end 
> ---
>
> Key: IGNITE-13298
> URL: https://issues.apache.org/jira/browse/IGNITE-13298
> Project: Ignite
>  Issue Type: Task
>Affects Versions: 2.7.6
> Environment: cluster memory 
> config/persistence 
>               class="org.apache.ignite.logger.log4j2.Log4J2Logger">                
>  value="${IGNITE_SCRIPT}/ignite-log4j2.xml" />                    
>                     
>        
>                              
>         
>                                 
>                            
>                                            
>                                   
>      value="${checkpointPageBufferSize}" />                                 
>                     value="${storagePath}" />                 value="${walPath}" />                 value="${walArchivePath}" />                 value="LOG_ONLY" />                 value="${pageSize}" />                       
>                        
>                     name="metricsEnabled" value="true"/>                    
> ==Client thread dump ===
> 2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) 
> 64-Bit Server VM (25.211-b12 mixed mode):
> "Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 
> nid=0x32d waiting on condition [0x]   java.lang.Thread.State: 
> RUNNABLE
>    Locked ownable synchronizers: - None
> "Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 
> tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000]   
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0xcb87d9d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>  at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) 
> at 
> com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110)
>  at 
> com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130)
>  at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers: - None
> "DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 
> tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000]   
> java.lang.Thread.State: TIMED_WAITING (on object monitor) at 
> java.lang.Object.wait(Native Method) at 
> com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130)
>  at 
> com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845)
>  - locked <0xcb8cce50> (a 
> com.solacesystems.jcsmp.impl.XMLMessageQueueList) at 
> com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) 
> at 
> org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86)
>  at 
> org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303)
>  at 
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179)
>  at 
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076)
>  at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers: - None
> "Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 
> tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000]   
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0xcb8cf8d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> 

[jira] [Created] (IGNITE-13298) Found long running cache at client end

2020-07-25 Thread Vipul Thakur (Jira)
Vipul Thakur created IGNITE-13298:
-

 Summary: Found long running cache at client end 
 Key: IGNITE-13298
 URL: https://issues.apache.org/jira/browse/IGNITE-13298
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7.6
 Environment: cluster memory config/persistence 

                             
                    
                                   
                                     
                                  
                          
         
                                       
                                  
                                                                                                                                   
                        
                                    

==Client thread dump ===

2020-07-20 12:14:432020-07-20 12:14:43Full thread dump Java HotSpot(TM) 64-Bit 
Server VM (25.211-b12 mixed mode):
"Attach Listener" #788 daemon prio=9 os_prio=0 tid=0x7fe7f4001000 nid=0x32d 
waiting on condition [0x]   java.lang.Thread.State: RUNNABLE
   Locked ownable synchronizers: - None
"Context_6_jms_314_ConsumerDispatcher" #787 daemon prio=5 os_prio=0 
tid=0x7fe6e805e000 nid=0x31a waiting on condition [0x7fe2e5bdd000]   
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for  <0xcb87d9d0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) 
at 
com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110)
 at 
com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130)
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers: - None
"DefaultMessageListenerContainer-35" #786 prio=5 os_prio=0 
tid=0x7fe460013800 nid=0x319 in Object.wait() [0x7fe2e5cde000]   
java.lang.Thread.State: TIMED_WAITING (on object monitor) at 
java.lang.Object.wait(Native Method) at 
com.solacesystems.jcsmp.impl.XMLMessageQueue.dequeue(XMLMessageQueue.java:130) 
at 
com.solacesystems.jcsmp.impl.flow.FlowHandleImpl.receive(FlowHandleImpl.java:845)
 - locked <0xcb8cce50> (a 
com.solacesystems.jcsmp.impl.XMLMessageQueueList) at 
com.solacesystems.jms.SolMessageConsumer.receive(SolMessageConsumer.java:253) 
at 
org.springframework.jms.connection.CachedMessageConsumer.receive(CachedMessageConsumer.java:86)
 at 
org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:132)
 at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:418)
 at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:303)
 at 
org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:257)
 at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1189)
 at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1179)
 at 
org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1076)
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers: - None
"Context_4_jms_313_ConsumerDispatcher" #785 daemon prio=5 os_prio=0 
tid=0x7fe6f8028000 nid=0x318 waiting on condition [0x7fe2e5ddf000]   
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for  <0xcb8cf8d0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403) 
at 
com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.eventLoop(ConsumerNotificationDispatcher.java:110)
 at 
com.solacesystems.jcsmp.protocol.nio.impl.ConsumerNotificationDispatcher.run(ConsumerNotificationDispatcher.java:130)
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers: - None
"DefaultMessageListenerContainer-27" #784 prio=5 os_prio=0