[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806067#comment-17806067
 ] 

Vipul Thakur commented on IGNITE-21059:
---

As i said all the nodes are from same data center and we dont have any kind 
restrictions in terms of connectivity, could it be a network fluctuation , is 
there any way to benchmark wrt ignite ? coz we also think this is happening due 
to that but we dont have a way to benchmark it to our n/w team.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806049#comment-17806049
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

all logs spammed with 

{noformat}
[2024-01-11T21:47:48,434][ERROR][client-connector-#3179%EVENT_PROCESSING%][TcpCommunicationSpi]
 Failed to send message to remote node [node=TcpDiscoveryNode 
[id=6f3d624b-a7dd-46e1-b229-db8a85cf8ece, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList 
[10.135.34.67, 127.0.0.1], sockAddrs=HashSet [/10.135.34.67:47500, 
/127.0.0.1:47500], discPort=47500, order=13, intOrder=9, 
lastExchangeTime=1704713028652, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false], msg=GridIoMessage [plc=10, topic=TOPIC_QUERY, topicOrd=19, 
ordered=false, timeout=0, skipOnTimeout=false, msg=GridQueryCancelRequest 
[qryReqId=33]]]
org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Failed to 
send message (node left topology): TcpDiscoveryNode 
[id=6f3d624b-a7dd-46e1-b229-db8a85cf8ece, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList 
[10.135.34.67, 127.0.0.1], sockAddrs=HashSet [/10.135.34.67:47500, 
/127.0.0.1:47500], discPort=47500, order=13, intOrder=9, 
lastExchangeTime=1704713028652, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]
{noformat}

is it normal ?


> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806047#comment-17806047
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

digitalapi-eventprocessing-service-app-64cd96f9c-k5nb2 and node1 have different 
timestamps it`s hard to detect a problem also through numerous disconnected 
exceptions, can you highlight the problemmatic timestamp interval ?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805949#comment-17805949
 ] 

Vipul Thakur commented on IGNITE-21059:
---

All of them are from same data center and there is no firewall and all the 
required ports are open as well. It happened when i stopped the node and 
restarted it.

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805890#comment-17805890
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

numerous :

{noformat}
Failed to connect to node (is node still alive?). Make sure that each 
ComputeTask and cache Transaction has a timeout set in order to prevent parties 
from waiting forever in case of network issues 
[nodeId=08ee4a70-3342-4e3b-abd4-3999bbea57e4, addrs=[/10.244.11.201:47100, 
/127.0.0.1:47100, 0:0:0:0:0:0:0:1%lo:47100]]
at 
org.apache.ignite.spi.communication.tcp.internal.GridNioServerWrapper.createNioSession(GridNioServerWrapper.java:565)
 ~[ignite-core-2.14.0.jar:2.14.0]
{noformat}

network problems ? 


{noformat}
Failed to connect to node (is node still alive?). Make sure that each 
ComputeTask and cache Transaction has a timeout set in order to prevent parties 
from waiting forever in case of network issues 
[nodeId=08ee4a70-3342-4e3b-abd4-3999bbea57e4, addrs=[/10.244.11.201:47100, 
/127.0.0.1:47100, 0:0:0:0:0:0:0:1%lo:47100]]

>> Selector info [id=0, keysCnt=2, bytesRcvd=86003560, bytesRcvd0=854, 
>> bytesSent=280419496, bytesSent0=0]
Connection info [in=true, rmtAddr=/10.135.34.53:36796, 
locAddr=/10.135.34.68:47100
{noformat}



> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805720#comment-17805720
 ] 

Vipul Thakur commented on IGNITE-21059:
---

I also ran 

*{{control.sh|bat --cache contention 5}}*

*OUTPUT*

JVM_OPTS environment variable is set, but will not be used. To pass JVM options 
use CONTROL_JVM_OPTS

JVM_OPTS=-Xms1g -Xmx1g -XX:+AlwaysPreTouch -Djava.net.preferIPv4Stack=true

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 


INFO: Client TCP connection established: localhost/127.0.0.1:11211

2024-01-11T22:40:23,579][INFO 
][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:41264, 
rmtAddr=/10.135.34.68:47100]

2024-01-11T22:40:23,594][INFO 
][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] 
Established outgoing communication connection [locAddr=/10.135.34.53:56674, 
rmtAddr=/10.135.34.67:47100]

Jan 11, 2024 10:40:23 PM 
org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection 
close

INFO: Client TCP connection closed: localhost/127.0.0.1:11211

Jan 11, 2024 10:40:23 PM org.apache.ignite.internal.client.util.GridClientUtils 
shutdownNow

WARNING: Runnable tasks outlived thread pool executor service 
[owner=GridClientConnectionManager, 
tasks=[java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53f65459]]

[node=TcpDiscoveryNode [id=acfd7965-2d2a-498f-aa89-a57da5208cb4, 
consistentId=c67390a7-9746-445b-9f40-b98ea32cc1ed, addrs=ArrayList 
[10.135.34.67, 127.0.0.1], sockAddrs=null, discPort=47500, order=90, 
intOrder=48, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode [id=3f5fc804-95f7-4151-809c-ad52c0528806, 
consistentId=3204dd77-8571-4c06-a059-aaf2ec06b739, addrs=ArrayList 
[10.135.34.53, 127.0.0.1], sockAddrs=null, discPort=47500, order=88, 
intOrder=47, lastExchangeTime=1704993022880, loc=false, 
ver=2.14.0#20220929-sha1:951e8deb, isClient=false]]

[node=TcpDiscoveryNode [id=855b22e7-0ad7-4521-ab53-3af65b6fce73, 
consistentId=ee70a820-92a5-48c7-a5da-4965c946b550, addrs=ArrayList 
[10.135.34.68, 127.0.0.1], sockAddrs=null, discPort=47500, order=4, intOrder=4, 
lastExchangeTime=1704993022880, loc=false, ver=2.14.0#20220929-sha1:951e8deb, 
isClient=false]]

Control utility [ver. 2.14.0#20220929-sha1:951e8deb]

2022 Copyright(C) Apache Software Foundation

Time: 2024-01-11T22:40:22.947

Command [CACHE] started

Arguments: --host localhost --port 11211 --user  --password * --cache 
contention 5 



Command [CACHE] finished with code: 0

Control utility has completed execution at: 2024-01-11T22:40:23.734

Execution time: 787 ms

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image-2024-01-11-22-28-51-501.png, image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805709#comment-17805709
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan]  | [~cos] 

 

I ran a test again today in my local environment by changing my consistency to 
optimistic and isolation level to serializable with 5s txn timeout and ran long 
running load with low traffic only like we have multiple jms listeners which 
communicate with ignite while writing data and during the load i restarted one 
node to mimic the change n/w topology of the cluster, so for the first time 
when i did this nothing happened and but when i did the next time with another 
node we can observe the same issue as we are observing in prod. 

The write services listeners went into choked state and my queue started piling 
up. 

[^ignite_issue_1101.zip]

The zip contains the thread dump of the service , logs of the pod and logs from 
all the 3 nodes from that environment.

 

We have increased the wal size to 512mb , reduce the txn timeout 5secs and 
rolled back failuredetection timeout and client failuredetection timeout to 
default values.

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, ignite_issue_1101.zip, 
> image.png, long_txn_.png, nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-02 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802018#comment-17802018
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

starvation in such a case means : no progress is detected by mention pool, i 
still can`t see (i check logs) the root problem but of course - reducing the 
number of processing clients can be helpful.  And still seems you have a dead 
locks? check [1] and reduce deadlock detection from 1 minute by default to 10 
sec ?

[1] 
https://ignite.apache.org/docs/latest/key-value-api/transactions#deadlock-detection

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2024-01-01 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801646#comment-17801646
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] 

Thank you for the observation.

 

We have also observed a new exception related to striped pool : 

 

2023-12-29 16:41:09.426 ERROR 1 --- [api.endpoint-22] 
b.b.EventProcessingErrorHandlerJmsSender : >>> Published error 
message ..EventProcessingErrorHandlerJmsSender ..
*2023-12-29 16:41:09.569  WARN 1 --- [85b8d7f7-ntw27%] 
o.a.i.i.processors.pool.PoolProcessor    : >>> Possible starvation in striped 
pool.*
    *Thread name: 
sys-stripe-0-#1%DIGITALAPI__PRIMARY_digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27%*
    Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_TX, 
topicOrd=20, ordered=false, timeout=0, skipOnTimeout=false, msg=TxLocksResponse 
[futId=2236, nearTxKeyLocks=HashMap {}, txKeys=null]]], Message closure 
[msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, 
timeout=0, skipOnTimeout=false, msg=GridNearLockResponse [pending=ArrayList [], 
miniId=1, dhtVers=GridCacheVersion[] [GridCacheVersion [topVer=312674347, 
order=1703970204663, nodeOrder=2, dataCenterId=0]], 
mappedVers=GridCacheVersion[] [GridCacheVersion [topVer=315266949, 
order=1703839756326, nodeOrder=2, dataCenterId=0]], clientRemapVer=null, 
compatibleRemapVer=false, super=GridDistributedLockResponse 
[futId=b9a9f75bc81-870cf83b-d2dd-4aa0-9d9f-bffdb8d46b1a, err=null, 
vals=ArrayList [BinaryObjectImpl [arr= true, ctx=false, start=0]], 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=315266949, 
order=1703839751829, nodeOrder=11, dataCenterId=0], commit

PFB for detailed logs

[^digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log]

 

Could be it due to having too many read client that our write services are 
getting affected. 

Should we be trying to decrease the no read services?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, digiapi-eventprocessing-app-zone1-6685b8d7f7-ntw27.log, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-30 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801313#comment-17801313
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

1 Gb is enough and message you mention is correct behavior.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-29 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801194#comment-17801194
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] 

 

Today we got another issue in production : 


2023-12-29T03:13:47,467][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 *Starting to clean WAL archive [highIdx=8303528, currSize=512.0 MB, 
maxSize=1.0 GB]*
2023-12-29T03:13:47,468][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Finish clean WAL archive [cleanCnt=1, currSize=448.0 MB, maxSize=1.0 GB]
2023-12-29T03:13:47,563][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Copied file 
[src=/datastore2/wal/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/0005.wal,
 
dst=/datastore2/archive/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/08303535.wal]
2023-12-29T03:14:17,080][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][Fil

 

In the above log it seems wal archive is also filling up fast. 

Should we also set  maxWalArchiveSize to a higher value from the default 1GB.

Find the logs from one of our node and this can bee seen in all the nodes

[^nohup_12.out]

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800933#comment-17800933
 ] 

Vipul Thakur commented on IGNITE-21059:
---

!image.png!

 

Yes CPU(s) from one of the physical nodes is 160.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800931#comment-17800931
 ] 

Vipul Thakur commented on IGNITE-21059:
---

I will get the exact value : as per the docs  {{max(8, total number of cores) 
this is how its calculated , i will ask my team to check it, we will also 
monitor pool usage.}}

Still i am not sure about why the threads are stuck even after they have been 
timed out from client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-27 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800922#comment-17800922
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

do you really have CPU(s): 160 per one physical node ? 
you need to check your pool usage (through jmx monitoring)

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800425#comment-17800425
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi 

[~zstan] 

*PFB for another such scenario from the logs file. This is the same kind of 
logs.[^ignite-server-nohup.out] , may be if you search for it in this file you 
will get more context for you to observe.* 

[2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
threadId=567, futId=13db6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641522, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55445052, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=5f4b66f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333873909, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641522, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334276938, duration=30003]]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1798)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter$PostLockClosure1.apply(IgniteTxLocalAdapter.java:1746)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$2.applyx(GridEmbeddedFuture.java:86)
 ~[ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:292)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridEmbeddedFuture$AsyncListener1.apply(GridEmbeddedFuture.java:285)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:464)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:348)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:336)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:576)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.GridCacheCompoundIdentityFuture.onDone(GridCacheCompoundIdentityFuture.java:56)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:555)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onComplete(GridDhtLockFuture.java:807)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.access$900(GridDhtLockFuture.java:93)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture$LockTimeoutObject.onTimeout(GridDhtLockFuture.java:1207)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:234)
 [ignite-core-2.14.0.jar:2.14.0]
    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) 
[ignite-core-2.14.0.jar:2.14.0]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_351]
2023-11-30T14:21:46,945][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800422#comment-17800422
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

I afraid i can`t help using this single picture.
You can append additional label to transaction 
org.apache.ignite.IgniteTransactions#withLabel this may help to detect the 
problem a bit closer.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-25 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800410#comment-17800410
 ] 

Vipul Thakur commented on IGNITE-21059:
---

* !long_txn_.png!

Hi 

 

[~zstan]  | [~cos]

Even after client pods are timed out after 30secs we can observe this in server 
logs where txn are running for longer time the start time was around 14:06 and 
log was printed at 14:16. 

Please help with your observations.

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out, long_txn_.png
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796799#comment-17796799
 ] 

Vipul Thakur commented on IGNITE-21059:
---

one of the jms listener was receiving more load  than rest of the listeners. 
What i can understand from the frequent logs about wal being to moved to disk 
is causing the issue as the data is being moved there is another write request 
for the same entity, as it is already busy being written to disk. 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796797#comment-17796797
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

is there are a big load in such a case ? some anomaly probably ?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796795#comment-17796795
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

ok i take a look a bit more, but :
1. as i can see the most problems is from 4 nodes, one of them: x.244.6.80 
node, you can grep it: grep 'Failed to acquire lock within provided timeout' 
nohup_26.out and check : nodeOrder=X there X is order of the node, after grep 
this order you can found the node tx initializer.
2. transactions can`t take a lock (i still can`t see the reason) but as i can 
see all transactions are rolled back
may be you have some monitoring for this nodes ?
you can reduce tx timeout and just rerun it after rollback or use optimistic as 
i already wrote.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796738#comment-17796738
 ] 

Vipul Thakur commented on IGNITE-21059:
---

in server logs can't find the same, still we will look into as of now no bulk 
operation is implemented.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796732#comment-17796732
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

probably you have deadlocks ? check: Deadlock detection was timed out, in 
attached logs.
You can search in ignite documentation how  to avoid it.
Fast check : do you insert unordered batch of keys ? If so - you need to sort 
it or use optimistic txs.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-14 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796724#comment-17796724
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi [~zstan] , today we observed the same issue in our other data center and 
restarting the apps helped.

I am attaching all nodes logs from the cluster -> Ignite_server_logs.zip

 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795899#comment-17795899
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Thank you for your response [~zstan]  

Will make the above changes and let you know how it goes, will also provide you 
the logs from all nodes.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795883#comment-17795883
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

1. yep just erase or comment them in config
2. ok here
3. 30 sec too match, if you detect tx rollback by timeout you can rerun it 
(check - optimistic tx may be more faster) but there are some differences tx 
write AND read can throws exception ! 
https://ignite.apache.org/docs/latest/key-value-api/transactions
4. can`t suggest here need to consider concrete usage.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795867#comment-17795867
 ] 

Vipul Thakur commented on IGNITE-21059:
---

So as per my understanding i will be doing the following, please correct me if 
am wrong : 

failureDetectionTimeout , clientFailureDetectionTimeout will switch back to 
default values which is 10secs and 30secs

will increase the walSegmentSize from default 64mb to bigger value maybe around 
512mb. [limit value being 2Gb]

Any comments regarding the txn timeout value which is 30secs at client.

TcpDiscoveryVmIpFinder – socket timeout is 60secs at server end and 5secs at 
client end.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795864#comment-17795864
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

if you are talking about TcpDiscoveryVmIpFinder socket timeout - this is not a 
linked things ...
i suggest to stay with both failureDetectionTimeout, 
clientFailureDetectionTimeout defaults and tune it only if you really found it 
would be helpful, but all failure issues need to be investigated, if system 
detects slow client (no matter where problem is, io\net\jvm pause) seems you no 
need such a client and need to fix the problem which leads to such situation at 
first.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795861#comment-17795861
 ] 

Vipul Thakur commented on IGNITE-21059:
---

We have daily requirement of 90-120 millions request for read and around 15-20 
millions 

current values : 

failureDetectionTimeout=12

clientFailureDetectionTimeout= 12

What would be the suggested value should bring this closer to what 
socketTimeout is like 5secs and should these configuration be same at both 
server and client end?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795860#comment-17795860
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

failureDetectionTimeout - too huge as for me, if someone will hangs, grid will 
wait until this timeout, problem with txs are expected here.
clientFailureDetectionTimeout - the same
rebalanceBatchSize and rebalanceThrottle i suggest defaults.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795859#comment-17795859
 ] 

Vipul Thakur commented on IGNITE-21059:
---

We also have configured socket timeout at server and client end but from thread 
dump is seems like its stuck at get call in all the txns.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795858#comment-17795858
 ] 

Vipul Thakur commented on IGNITE-21059:
---

In 2.7.6 we use to observe long jvm pause logger in read services and not that 
much in write. 

Such behavior is not observed in 2.14 we have another such setup with same 
amount of nodes in cluster and same amount client serving as another datacenter 
for our api endpoint it has been running with no problems over a month now , 
but when we upgraded our other data center this issue occurred after just 3 
days of upgrade.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795855#comment-17795855
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

i suppose you no need such huge amount of readers writers, client nodes are not 
the narrow place at all (but it`s not a root cause of course) long jvm pause on 
*client* node - can lead to your problem i think.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795851#comment-17795851
 ] 

Vipul Thakur commented on IGNITE-21059:
---

we have two k8s cluster connected to that datacenter where in each k8s cluster 
10 are read , 10 are write and 2 are kind of admin service. So in total of 44 
client nodes. And i have also updated our cluster spec its 5 nodes , 400GB RAM 
and 1 Tb SDD.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795846#comment-17795846
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

do you really need 44 client nodes ? seems that client nodes restart help here 
? is it all ok with client nodes ? no long jvm pauses ?

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795838#comment-17795838
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Ok please give me sometime and we will change the wal size and let u know.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795836#comment-17795836
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

need all logs from all nodes for further analyze, also check : 
https://ignite.apache.org/docs/latest/tools/control-script#transaction-management
and change wal size

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795735#comment-17795735
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Evidence that txn timeout is enabled at client end : 

 

2023-11-30T14:19:01,783][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641101, nodeOrder=53, dataCenterId=0], 
threadId=372, futId=9c4a6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ, retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641101, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444220, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
[2023-11-30T14:19:44,579][ERROR][grid-timeout-worker-#326%EVENT_PROCESSING%][GridDhtColocatedCache]
  Failed to acquire lock for request: GridNearLockRequest 
[topVer=AffinityTopologyVersion [topVer=93, minorTopVer=0], miniId=1, 
dhtVers=GridCacheVersion[] [null], taskNameHash=0, createTtl=-1, accessTtl=-1, 
flags=3, txLbl=null, filter=null, super=GridDistributedLockRequest 
[nodeId=62fdf256-6130-4ef3-842c-b2078f6e6c07, nearXidVer=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
threadId=897, futId=a3ba6212c81-c17f568a-3419-42a6-9042-7a1f3281301c, 
*timeout=3, isInTx=true, isInvalidate=false, isRead=true, 
isolation=REPEATABLE_READ,* retVals=[true], txSize=0, flags=0, keysCnt=1, 
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=312674007, 
order=1701333641190, nodeOrder=53, dataCenterId=0], committedVers=null, 
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=-885490198, 
super=GridCacheMessage [msgId=55444392, depInfo=null, 
lastAffChangedTopVer=AffinityTopologyVersion [topVer=53, minorTopVer=0], 
err=null, skipPrepare=false]
org.apache.ignite.internal.transactions.IgniteTxTimeoutCheckedException: Failed 
to acquire lock within provided timeout for transaction [timeout=3, 
tx=GridDhtTxLocal[xid=c8a166f1c81--12a3-06d7--0001, 
xidVersion=GridCacheVersion [topVer=312674007, order=1701333834380, 
nodeOrder=1, dataCenterId=0], nearXidVersion=GridCacheVersion 
[topVer=312674007, order=1701333641190, nodeOrder=53, dataCenterId=0], 
concurrency=PESSIMISTIC, isolation=REPEATABLE_READ, state=MARKED_ROLLBACK, 
invalidate=false, rollbackOnly=true, 
nodeId=f751efe5-c44c-4b3c-bcd3-dd5866ec0bdd, timeout=3, 
startTime=1701334154571, {*}duration=30003]{*}]

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything 

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795714#comment-17795714
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi 

Thank you for quick response, we have configured tx timeout at client end our 
clients are written in spring boot and java , is it needed at server's 
config.xml also ? 

We will also read about chaning-wal-segment-size and make the changes 
accordingly 

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795690#comment-17795690
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

also you have infinite tx timeouts, plz configure :
https://ignite.apache.org/docs/latest/key-value-api/transactions#deadlock-detection

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795688#comment-17795688
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

also plz increase 
https://ignite.apache.org/docs/latest/persistence/native-persistence#changing-wal-segment-size
numerous "Starting to clean WAL archive" in log


> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-12 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795675#comment-17795675
 ] 

Evgeny Stanilovsky commented on IGNITE-21059:
-

[~vipul.thakur] can you attach logs corresponding to observed problem (some 
time earlier and some after) the incident ? Thread dumps seems can`t help here 
... If logs are already rotated plz make fresh copy if incident repeats.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795616#comment-17795616
 ] 

Vipul Thakur commented on IGNITE-21059:
---

@cos Please help in review

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

2023-12-11 Thread Vipul Thakur (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795614#comment-17795614
 ] 

Vipul Thakur commented on IGNITE-21059:
---

Hi Please review and comment and let me know if more info is needed.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> 
>
> Key: IGNITE-21059
> URL: https://issues.apache.org/jira/browse/IGNITE-21059
> Project: Ignite
>  Issue Type: Bug
>  Components: binary, clients
>Affects Versions: 2.14
>Reporter: Vipul Thakur
>Priority: Critical
> Attachments: cache-config-1.xml, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 4 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB HDD.
> PFB for the attached config.[I have added it as attachment for review]
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)