[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

Vipul Thakur (Jira) Fri, 29 Dec 2023 06:55:06 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801194#comment-17801194
 ]


Vipul Thakur commented on IGNITE-21059:
---------------------------------------

Hi [~zstan] 

 

Today we got another issue in production : 


2023-12-29T03:13:47,467][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 *Starting to clean WAL archive [highIdx=8303528, currSize=512.0 MB, 
maxSize=1.0 GB]*
2023-12-29T03:13:47,468][INFO 
][wal-file-cleaner%EVENT_PROCESSING-#715%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Finish clean WAL archive [cleanCnt=1, currSize=448.0 MB, maxSize=1.0 GB]
2023-12-29T03:13:47,563][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][FileWriteAheadLogManager]
 Copied file 
[src=/datastore2/wal/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/0000000000000005.wal,
 
dst=/datastore2/archive/node00-eb1d0680-c0b7-41dd-a0b1-f1f5e419cbe6/0000000008303535.wal]
2023-12-29T03:14:17,080][INFO 
][wal-file-archiver%EVENT_PROCESSING-#714%EVENT_PROCESSING%][Fil

 

In the above log it seems wal archive is also filling up fast. 

Should we also set  maxWalArchiveSize to a higher value from the default 1GB.

Find the logs from one of our node and this can bee seen in all the nodes

[^nohup_12.out]

Please help us with your observations.

> We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running 
> cache operations
> --------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-21059
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21059
>             Project: Ignite
>          Issue Type: Bug
>          Components: binary, clients
>    Affects Versions: 2.14
>            Reporter: Vipul Thakur
>            Priority: Critical
>         Attachments: Ignite_server_logs.zip, cache-config-1.xml, 
> client-service.zip, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt2, 
> digiapi-eventprocessing-app-zone1-696c8c4946-62jbx-jstck.txt3, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt1, 
> digiapi-eventprocessing-app-zone1-696c8c4946-7d57w-jstck.txt2, 
> ignite-server-nohup-1.out, ignite-server-nohup.out, image.png, long_txn_.png, 
> nohup_12.out
>
>
> We have recently upgraded from 2.7.6 to 2.14 due to the issue observed in 
> production environment where cluster would go in hang state due to partition 
> map exchange.
> Please find the below ticket which i created a while back for ignite 2.7.6
> https://issues.apache.org/jira/browse/IGNITE-13298
> So we migrated the apache ignite version to 2.14 and upgrade happened 
> smoothly but on the third day we could see cluster traffic dip again. 
> We have 5 nodes in a cluster where we provide 400 GB of RAM and more than 1 
> TB SDD.
> PFB for the attached config.[I have added it as attachment for review]
> I have also added the server logs from the same time when issue happened.
> We have set txn timeout as well as socket timeout both at server and client 
> end for our write operations but seems like sometimes cluster goes into hang 
> state and all our get calls are stuck and slowly everything starts to freeze 
> our jms listener threads and every thread reaches a choked up state in 
> sometime.
> Due to which our read services which does not even use txn to retrieve data 
> also starts to choke. Ultimately leading to end user traffic dip.
> We were hoping product upgrade will help but that has not been the case till 
> now. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-21059) We have upgraded our ignite instance from 2.7.6 to 2.14. Found long running cache operations

Reply via email to