No, sorry, the issue is that I don't have the time to go through 25,000 lines of log file. As I said, your cluster had network or long JVM pause issues, probably the latter:
[21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 63356 milliseconds. When nodes are continually talking to one another, no Ignite code being executed for over a minute is going to be a *big* problem. You need to tune your JVM. There are some hints in the documentation: https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning On Tue, 31 Oct 2023 at 13:16, John Smith <java.dev....@gmail.com> wrote: > Does any of this infor help? I included what we do more or less plus stats > and configs. > > There are 9 caches of which the biggest one is 5 million records > (partitioned with 1 backup), the key is String (11 chars) and the value > integer. > > The rest are replicated and some partitioned but max a few thousand > records at best. > > The nodes are 32GB here is the output of the free -m > > total used free shared buff/cache > available > Mem: 32167 2521 26760 0 2885 > 29222 > Swap: 2047 0 2047 > > And here is node stats: > > Time of the snapshot: 2023-10-31 13:08:56 > > +---------------------------------------------------------------------------------+ > | ID | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e > | > | ID8 | E8044C1A > | > | Consistent ID | b14350a9-6963-442c-9529-14f70f95a6d9 > | > | Node Type | Server > | > | Order | 2660 > | > | Address (0) | xxxxxx > | > | Address (1) | 127.0.0.1 > | > | Address (2) | 0:0:0:0:0:0:0:1%lo > | > | OS info | Linux amd64 4.15.0-197-generic > | > | OS user | ignite > | > | Deployment mode | SHARED > | > | Language runtime | Java Platform API Specification ver. 1.8 > | > | Ignite version | 2.12.0 > | > | Ignite instance name | xxxxxx > | > | JRE information | HotSpot 64-Bit Tiered Compilers > | > | JVM start time | 2023-09-29 14:50:39 > | > | Node start time | 2023-09-29 14:54:34 > | > | Up time | 09:28:57.946 > | > | CPUs | 4 > | > | Last metric update | 2023-10-31 13:07:49 > | > | Non-loopback IPs | xxxxxx, xxxxxx | > | Enabled MACs | xxxxxx > | > | Maximum active jobs | 1 > | > | Current active jobs | 0 > | > | Average active jobs | 0.01 > | > | Maximum waiting jobs | 0 > | > | Current waiting jobs | 0 > | > | Average waiting jobs | 0.00 > | > | Maximum rejected jobs | 0 > | > | Current rejected jobs | 0 > | > | Average rejected jobs | 0.00 > | > | Maximum cancelled jobs | 0 > | > | Current cancelled jobs | 0 > | > | Average cancelled jobs | 0.00 > | > | Total rejected jobs | 0 > | > | Total executed jobs | 2 > | > | Total cancelled jobs | 0 > | > | Maximum job wait time | 0ms > | > | Current job wait time | 0ms > | > | Average job wait time | 0.00ms > | > | Maximum job execute time | 11ms > | > | Current job execute time | 0ms > | > | Average job execute time | 5.50ms > | > | Total busy time | 5733919ms > | > | Busy time % | 0.21% > | > | Current CPU load % | 1.93% > | > | Average CPU load % | 4.35% > | > | Heap memory initialized | 504mb > | > | Heap memory used | 310mb > | > | Heap memory committed | 556mb > | > | Heap memory maximum | 8gb > | > | Non-heap memory initialized | 2mb > | > | Non-heap memory used | 114mb > | > | Non-heap memory committed | 119mb > | > | Non-heap memory maximum | 0 > | > | Current thread count | 125 > | > | Maximum thread count | 140 > | > | Total started thread count | 409025 > | > | Current daemon thread count | 15 > | > > +---------------------------------------------------------------------------------+ > > Data region metrics: > > +==========================================================================================================================+ > | Name | Page size | Pages | Memory | > Rates | Checkpoint buffer | Large entries | > > +==========================================================================================================================+ > | Default_Region | 0 | Total: 307665 | Total: 1gb | > Allocation: 0.00 | Pages: 0 | 0.00% | > | | | Dirty: 0 | In RAM: 0 | > Eviction: 0.00 | Size: 0 | | > | | | Memory: 0 | | > Replace: 0.00 | | | > | | | Fill factor: 0.00% | | > | | | > > +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ > | metastoreMemPlc | 0 | Total: 57 | Total: 228kb | > Allocation: 0.00 | Pages: 0 | 0.00% | > | | | Dirty: 0 | In RAM: 0 | > Eviction: 0.00 | Size: 0 | | > | | | Memory: 0 | | > Replace: 0.00 | | | > | | | Fill factor: 0.00% | | > | | | > > +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ > | sysMemPlc | 0 | Total: 5 | Total: 20kb | > Allocation: 0.00 | Pages: 0 | 0.00% | > | | | Dirty: 0 | In RAM: 0 | > Eviction: 0.00 | Size: 0 | | > | | | Memory: 0 | | > Replace: 0.00 | | | > | | | Fill factor: 0.00% | | > | | | > > +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ > | TxLog | 0 | Total: 0 | Total: 0 | > Allocation: 0.00 | Pages: 0 | 0.00% | > | | | Dirty: 0 | In RAM: 0 | > Eviction: 0.00 | Size: 0 | | > | | | Memory: 0 | | > Replace: 0.00 | | | > | | | Fill factor: 0.00% | | > | | | > > +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ > | volatileDsMemPlc | 0 | Total: 0 | Total: 0 | > Allocation: 0.00 | Pages: 0 | 0.00% | > | | | Dirty: 0 | In RAM: 0 | > Eviction: 0.00 | Size: 0 | | > | | | Memory: 0 | | > Replace: 0.00 | | | > | | | Fill factor: 0.00% | | > | | | > > +--------------------------------------------------------------------------------------------------------------------------+ > > Server nodes config... > > if [ -z "$JVM_OPTS" ] ; then > JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m" > fi > > # > # Uncomment the following GC settings if you see spikes in your throughput > due to Garbage Collection. > # > # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" > JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC > -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC > -XX:MaxDirectMemorySize=256m" > > And we use this as our persistence config... > > <property name="dataStorageConfiguration"> > <bean > class="org.apache.ignite.configuration.DataStorageConfiguration"> > <property name="writeThrottlingEnabled" value="true"/> > > <!-- Redefining the default region's settings --> > <property name="defaultDataRegionConfiguration"> > <bean > class="org.apache.ignite.configuration.DataRegionConfiguration"> > <property name="persistenceEnabled" value="true"/> > > <property name="name" value="Default_Region"/> > <property name="maxSize" value="#{10L * 1024 * 1024 * > 1024}"/> > </bean> > </property> > </bean> > </property> > > On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <sdarling...@apache.org> > wrote: > >> There's a lot going on in that log file. It makes it difficult to tell >> what *the* issue is. You have lots of nodes leaving (and joining) the >> cluster, including server nodes. You have lost partitions and long JVM >> pauses. I suspect the real cause of this node shutting down was that it >> became segmented. >> >> Chances are the issue is either a genuine network issue or the long JVM >> pauses -- which means that the nodes are not talking to each other -- >> caused the cluster to fall apart. >> >