Hello everyone, We successfully deployed on production a cas v5.2.3 a couple of days ago.
Our configuration is : two active/passive cas nodes with a in memory (save JVM as cas) hazelcast cluster that replicates the tickets. Everything worked fine for the first two hours, but when the connections ramped up, the active node froze. We realized that the heap (2g max) was full, so we stopped both nodes to bump up the xmx to 6g on each nodes. After that cas worked perfectly. When monitoring the heap through the day, we noticed a very steep curve going from 1g around 9am to a max around 11am at 5.5g. Then the curve flattened and stayed around 5.5 until 8pm. After that the heap when down to around 4g During the 11am - 8pm period, several things happened : - master GC time increased up to 3s degrading the reponse time of the applications that use cas. We suspect this is related to cache eviction, the frequency was around one major GC every 30 min. - some users where disconnected without notice during the afternoon (or had issues granting PTs), obviously a consequence of the cache hitting its max allowed size and aggressively evicting tickets. We suspected an eviction problem with hazelcast, so we did a heap dump and we installed hazecast management center. Our first observations were : - we had a backup count set at 1 which doubled the size of the cluster. - we had a huge amount of PGT : around 200000 for 3000 TGT - PGT are quite big >10k (dixit hazelcast mancenter) So for the next day we disabled the hazelcast backup. Now our heap usage is a little better. The heap start around 1g at 9am to plateau at 5.5g around 12. From 12 to 4pm the curve stay flat around 5.5g with only minor GC. Around 4pm major gc occurs every 30 min until 6pm, the the heap goes down. Our tickets are supposed to expire after 6h. So, the way I read this is : people start working around 9am,they produce a lot of tickets between 9 and 12, hence the steep curve. Between 12 and 14 the activity slows downs and ticket production stops while the tickets created around 8am start to be evicted slowly. After 14 activity starts again and tickets are created. Around 4pm the cache is full and massively evicts the tickets created in the morning hence the major GCs No users complained about being disconnected, but the heap stay close to its max a good part of the day,and we still have around 200000 pgts for 3000 TGT. And we have around 350 thread runing all day. Our configuration is : Xmx 6g Eviction policy : default with TTL 6h ttk 6h for tgt (and PGT) LFU Hazelcast max heap size 70 GC g1c java 8 Cas War overlay with undertow A dozen webapps using 60+ webservices all protected by cas For now it works but we have to restart the nodes every nights to clean the heap. I don't like the idea of the heap being 90% full all the day, if the number of connections increases we might have unwanted disconnections again. And the thread number is a concern as well. And I would like to do something about these issues. My questions : - are these numbers normal ? - 200000 pgts for 3000 tgt - 3g of pgts ? - 350 thread all day ? - 90% of the heap full all day ? - is our eviction policy correct ? I can't decide if we have a memory leak or if it's a normal situation considering our 3000 users and our 70+ applications linked by cas. We would feel more comfortable is the heap wasn't at 90% all day. We have several options now : - try lru instead of lfu - reduce the tgt TTL to 4h - use a different evicition policy like a timeout on the tickets - bump up the xmx Hoping we would hit the sweet spot between memory consumption and cache eviction but taking the risk of lengthy major Gc - put the hazelcast clusters in their own JVM - do nothing because everything is normal ... I know it's a long text so thank you for reading everything ! Any advice will be appreciated ! -- - Website: https://apereo.github.io/cas - Gitter Chatroom: https://gitter.im/apereo/cas - List Guidelines: https://goo.gl/1VRrw7 - Contributions: https://goo.gl/mh7qDG --- You received this message because you are subscribed to the Google Groups "CAS Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+unsubscr...@apereo.org. To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/ec5d098d-d5f9-4ec3-99b0-0f773ca966b3%40apereo.org.