I understand you have no time and I have also followed that link. My nodes are 32GB and I have allocated 8GB for heap and some for off-heap. So I'm def not hitting some ceiling where it needs to try to force some huge garbage collection.
What 'i'm asking based on the config and stats I gave do you see anything that sticks out in those configs not the logs? On Tue, Oct 31, 2023 at 10:42 AM Stephen Darlington <[email protected]> wrote: > No, sorry, the issue is that I don't have the time to go through 25,000 > lines of log file. As I said, your cluster had network or long JVM pause > issues, probably the latter: > > [21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] > Possible too long JVM pause: 63356 milliseconds. > > When nodes are continually talking to one another, no Ignite code being > executed for over a minute is going to be a *big* problem. You need to > tune your JVM. There are some hints in the documentation: > https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning > > > On Tue, 31 Oct 2023 at 13:16, John Smith <[email protected]> wrote: > >> Does any of this infor help? I included what we do more or less plus >> stats and configs. >> >> There are 9 caches of which the biggest one is 5 million records >> (partitioned with 1 backup), the key is String (11 chars) and the value >> integer. >> >> The rest are replicated and some partitioned but max a few thousand >> records at best. >> >> The nodes are 32GB here is the output of the free -m >> >> total used free shared buff/cache >> available >> Mem: 32167 2521 26760 0 2885 >> 29222 >> Swap: 2047 0 2047 >> >> And here is node stats: >> >> Time of the snapshot: 2023-10-31 13:08:56 >> >> +---------------------------------------------------------------------------------+ >> | ID | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e >> | >> | ID8 | E8044C1A >> | >> | Consistent ID | b14350a9-6963-442c-9529-14f70f95a6d9 >> | >> | Node Type | Server >> | >> | Order | 2660 >> | >> | Address (0) | xxxxxx >> | >> | Address (1) | 127.0.0.1 >> | >> | Address (2) | 0:0:0:0:0:0:0:1%lo >> | >> | OS info | Linux amd64 4.15.0-197-generic >> | >> | OS user | ignite >> | >> | Deployment mode | SHARED >> | >> | Language runtime | Java Platform API Specification ver. 1.8 >> | >> | Ignite version | 2.12.0 >> | >> | Ignite instance name | xxxxxx >> | >> | JRE information | HotSpot 64-Bit Tiered Compilers >> | >> | JVM start time | 2023-09-29 14:50:39 >> | >> | Node start time | 2023-09-29 14:54:34 >> | >> | Up time | 09:28:57.946 >> | >> | CPUs | 4 >> | >> | Last metric update | 2023-10-31 13:07:49 >> | >> | Non-loopback IPs | xxxxxx, xxxxxx | >> | Enabled MACs | xxxxxx >> | >> | Maximum active jobs | 1 >> | >> | Current active jobs | 0 >> | >> | Average active jobs | 0.01 >> | >> | Maximum waiting jobs | 0 >> | >> | Current waiting jobs | 0 >> | >> | Average waiting jobs | 0.00 >> | >> | Maximum rejected jobs | 0 >> | >> | Current rejected jobs | 0 >> | >> | Average rejected jobs | 0.00 >> | >> | Maximum cancelled jobs | 0 >> | >> | Current cancelled jobs | 0 >> | >> | Average cancelled jobs | 0.00 >> | >> | Total rejected jobs | 0 >> | >> | Total executed jobs | 2 >> | >> | Total cancelled jobs | 0 >> | >> | Maximum job wait time | 0ms >> | >> | Current job wait time | 0ms >> | >> | Average job wait time | 0.00ms >> | >> | Maximum job execute time | 11ms >> | >> | Current job execute time | 0ms >> | >> | Average job execute time | 5.50ms >> | >> | Total busy time | 5733919ms >> | >> | Busy time % | 0.21% >> | >> | Current CPU load % | 1.93% >> | >> | Average CPU load % | 4.35% >> | >> | Heap memory initialized | 504mb >> | >> | Heap memory used | 310mb >> | >> | Heap memory committed | 556mb >> | >> | Heap memory maximum | 8gb >> | >> | Non-heap memory initialized | 2mb >> | >> | Non-heap memory used | 114mb >> | >> | Non-heap memory committed | 119mb >> | >> | Non-heap memory maximum | 0 >> | >> | Current thread count | 125 >> | >> | Maximum thread count | 140 >> | >> | Total started thread count | 409025 >> | >> | Current daemon thread count | 15 >> | >> >> +---------------------------------------------------------------------------------+ >> >> Data region metrics: >> >> +==========================================================================================================================+ >> | Name | Page size | Pages | Memory | >> Rates | Checkpoint buffer | Large entries | >> >> +==========================================================================================================================+ >> | Default_Region | 0 | Total: 307665 | Total: 1gb | >> Allocation: 0.00 | Pages: 0 | 0.00% | >> | | | Dirty: 0 | In RAM: 0 | >> Eviction: 0.00 | Size: 0 | | >> | | | Memory: 0 | | >> Replace: 0.00 | | | >> | | | Fill factor: 0.00% | | >> | | | >> >> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >> | metastoreMemPlc | 0 | Total: 57 | Total: 228kb | >> Allocation: 0.00 | Pages: 0 | 0.00% | >> | | | Dirty: 0 | In RAM: 0 | >> Eviction: 0.00 | Size: 0 | | >> | | | Memory: 0 | | >> Replace: 0.00 | | | >> | | | Fill factor: 0.00% | | >> | | | >> >> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >> | sysMemPlc | 0 | Total: 5 | Total: 20kb | >> Allocation: 0.00 | Pages: 0 | 0.00% | >> | | | Dirty: 0 | In RAM: 0 | >> Eviction: 0.00 | Size: 0 | | >> | | | Memory: 0 | | >> Replace: 0.00 | | | >> | | | Fill factor: 0.00% | | >> | | | >> >> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >> | TxLog | 0 | Total: 0 | Total: 0 | >> Allocation: 0.00 | Pages: 0 | 0.00% | >> | | | Dirty: 0 | In RAM: 0 | >> Eviction: 0.00 | Size: 0 | | >> | | | Memory: 0 | | >> Replace: 0.00 | | | >> | | | Fill factor: 0.00% | | >> | | | >> >> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >> | volatileDsMemPlc | 0 | Total: 0 | Total: 0 | >> Allocation: 0.00 | Pages: 0 | 0.00% | >> | | | Dirty: 0 | In RAM: 0 | >> Eviction: 0.00 | Size: 0 | | >> | | | Memory: 0 | | >> Replace: 0.00 | | | >> | | | Fill factor: 0.00% | | >> | | | >> >> +--------------------------------------------------------------------------------------------------------------------------+ >> >> Server nodes config... >> >> if [ -z "$JVM_OPTS" ] ; then >> JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m" >> fi >> >> # >> # Uncomment the following GC settings if you see spikes in your >> throughput due to Garbage Collection. >> # >> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" >> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC >> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC >> -XX:MaxDirectMemorySize=256m" >> >> And we use this as our persistence config... >> >> <property name="dataStorageConfiguration"> >> <bean >> class="org.apache.ignite.configuration.DataStorageConfiguration"> >> <property name="writeThrottlingEnabled" value="true"/> >> >> <!-- Redefining the default region's settings --> >> <property name="defaultDataRegionConfiguration"> >> <bean >> class="org.apache.ignite.configuration.DataRegionConfiguration"> >> <property name="persistenceEnabled" value="true"/> >> >> <property name="name" value="Default_Region"/> >> <property name="maxSize" value="#{10L * 1024 * 1024 * >> 1024}"/> >> </bean> >> </property> >> </bean> >> </property> >> >> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington < >> [email protected]> wrote: >> >>> There's a lot going on in that log file. It makes it difficult to tell >>> what *the* issue is. You have lots of nodes leaving (and joining) the >>> cluster, including server nodes. You have lost partitions and long JVM >>> pauses. I suspect the real cause of this node shutting down was that it >>> became segmented. >>> >>> Chances are the issue is either a genuine network issue or the long JVM >>> pauses -- which means that the nodes are not talking to each other -- >>> caused the cluster to fall apart. >>> >>
