Does any of this infor help? I included what we do more or less plus stats
and configs.
There are 9 caches of which the biggest one is 5 million records
(partitioned with 1 backup), the key is String (11 chars) and the value
integer.
The rest are replicated and some partitioned but max a few thousand records
at best.
The nodes are 32GB here is the output of the free -m
total used free shared buff/cache
available
Mem: 32167 2521 26760 0 2885
29222
Swap: 2047 0 2047
And here is node stats:
Time of the snapshot: 2023-10-31 13:08:56
+---------------------------------------------------------------------------------+
| ID | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
|
| ID8 | E8044C1A
|
| Consistent ID | b14350a9-6963-442c-9529-14f70f95a6d9
|
| Node Type | Server
|
| Order | 2660
|
| Address (0) | xxxxxx
|
| Address (1) | 127.0.0.1
|
| Address (2) | 0:0:0:0:0:0:0:1%lo
|
| OS info | Linux amd64 4.15.0-197-generic
|
| OS user | ignite
|
| Deployment mode | SHARED
|
| Language runtime | Java Platform API Specification ver. 1.8
|
| Ignite version | 2.12.0
|
| Ignite instance name | xxxxxx
|
| JRE information | HotSpot 64-Bit Tiered Compilers
|
| JVM start time | 2023-09-29 14:50:39
|
| Node start time | 2023-09-29 14:54:34
|
| Up time | 09:28:57.946
|
| CPUs | 4
|
| Last metric update | 2023-10-31 13:07:49
|
| Non-loopback IPs | xxxxxx, xxxxxx |
| Enabled MACs | xxxxxx
|
| Maximum active jobs | 1
|
| Current active jobs | 0
|
| Average active jobs | 0.01
|
| Maximum waiting jobs | 0
|
| Current waiting jobs | 0
|
| Average waiting jobs | 0.00
|
| Maximum rejected jobs | 0
|
| Current rejected jobs | 0
|
| Average rejected jobs | 0.00
|
| Maximum cancelled jobs | 0
|
| Current cancelled jobs | 0
|
| Average cancelled jobs | 0.00
|
| Total rejected jobs | 0
|
| Total executed jobs | 2
|
| Total cancelled jobs | 0
|
| Maximum job wait time | 0ms
|
| Current job wait time | 0ms
|
| Average job wait time | 0.00ms
|
| Maximum job execute time | 11ms
|
| Current job execute time | 0ms
|
| Average job execute time | 5.50ms
|
| Total busy time | 5733919ms
|
| Busy time % | 0.21%
|
| Current CPU load % | 1.93%
|
| Average CPU load % | 4.35%
|
| Heap memory initialized | 504mb
|
| Heap memory used | 310mb
|
| Heap memory committed | 556mb
|
| Heap memory maximum | 8gb
|
| Non-heap memory initialized | 2mb
|
| Non-heap memory used | 114mb
|
| Non-heap memory committed | 119mb
|
| Non-heap memory maximum | 0
|
| Current thread count | 125
|
| Maximum thread count | 140
|
| Total started thread count | 409025
|
| Current daemon thread count | 15
|
+---------------------------------------------------------------------------------+
Data region metrics:
+==========================================================================================================================+
| Name | Page size | Pages | Memory |
Rates | Checkpoint buffer | Large entries |
+==========================================================================================================================+
| Default_Region | 0 | Total: 307665 | Total: 1gb |
Allocation: 0.00 | Pages: 0 | 0.00% |
| | | Dirty: 0 | In RAM: 0 |
Eviction: 0.00 | Size: 0 | |
| | | Memory: 0 | |
Replace: 0.00 | | |
| | | Fill factor: 0.00% | |
| | |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| metastoreMemPlc | 0 | Total: 57 | Total: 228kb |
Allocation: 0.00 | Pages: 0 | 0.00% |
| | | Dirty: 0 | In RAM: 0 |
Eviction: 0.00 | Size: 0 | |
| | | Memory: 0 | |
Replace: 0.00 | | |
| | | Fill factor: 0.00% | |
| | |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| sysMemPlc | 0 | Total: 5 | Total: 20kb |
Allocation: 0.00 | Pages: 0 | 0.00% |
| | | Dirty: 0 | In RAM: 0 |
Eviction: 0.00 | Size: 0 | |
| | | Memory: 0 | |
Replace: 0.00 | | |
| | | Fill factor: 0.00% | |
| | |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| TxLog | 0 | Total: 0 | Total: 0 |
Allocation: 0.00 | Pages: 0 | 0.00% |
| | | Dirty: 0 | In RAM: 0 |
Eviction: 0.00 | Size: 0 | |
| | | Memory: 0 | |
Replace: 0.00 | | |
| | | Fill factor: 0.00% | |
| | |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| volatileDsMemPlc | 0 | Total: 0 | Total: 0 |
Allocation: 0.00 | Pages: 0 | 0.00% |
| | | Dirty: 0 | In RAM: 0 |
Eviction: 0.00 | Size: 0 | |
| | | Memory: 0 | |
Replace: 0.00 | | |
| | | Fill factor: 0.00% | |
| | |
+--------------------------------------------------------------------------------------------------------------------------+
Server nodes config...
if [ -z "$JVM_OPTS" ] ; then
JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
fi
#
# Uncomment the following GC settings if you see spikes in your throughput
due to Garbage Collection.
#
# JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
-XX:MaxDirectMemorySize=256m"
And we use this as our persistence config...
<property name="dataStorageConfiguration">
<bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="writeThrottlingEnabled" value="true"/>
<!-- Redefining the default region's settings -->
<property name="defaultDataRegionConfiguration">
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<property name="name" value="Default_Region"/>
<property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
</bean>
</property>
</bean>
</property>
On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <[email protected]>
wrote:
> There's a lot going on in that log file. It makes it difficult to tell
> what *the* issue is. You have lots of nodes leaving (and joining) the
> cluster, including server nodes. You have lost partitions and long JVM
> pauses. I suspect the real cause of this node shutting down was that it
> became segmented.
>
> Chances are the issue is either a genuine network issue or the long JVM
> pauses -- which means that the nodes are not talking to each other --
> caused the cluster to fall apart.
>