Does any of this infor help? I included what we do more or less plus stats
and configs.

There are 9 caches of which the biggest one is 5 million records
(partitioned with 1 backup), the key is String (11 chars) and the value
integer.

The rest are replicated and some partitioned but max a few thousand records
at best.

The nodes are 32GB here is the output of the free -m

              total        used        free      shared  buff/cache
available
Mem:          32167        2521       26760           0        2885
29222
Swap:          2047           0        2047

And here is node stats:

Time of the snapshot: 2023-10-31 13:08:56
+---------------------------------------------------------------------------------+
| ID                          | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
       |
| ID8                         | E8044C1A
       |
| Consistent ID               | b14350a9-6963-442c-9529-14f70f95a6d9
       |
| Node Type                   | Server
       |
| Order                       | 2660
       |
| Address (0)                 | xxxxxx
|
| Address (1)                 | 127.0.0.1
      |
| Address (2)                 | 0:0:0:0:0:0:0:1%lo
       |
| OS info                     | Linux amd64 4.15.0-197-generic
       |
| OS user                     | ignite
       |
| Deployment mode             | SHARED
       |
| Language runtime            | Java Platform API Specification ver. 1.8
       |
| Ignite version              | 2.12.0
       |
| Ignite instance name        | xxxxxx
      |
| JRE information             | HotSpot 64-Bit Tiered Compilers
      |
| JVM start time              | 2023-09-29 14:50:39
      |
| Node start time             | 2023-09-29 14:54:34
      |
| Up time                     | 09:28:57.946
       |
| CPUs                        | 4
      |
| Last metric update          | 2023-10-31 13:07:49
      |
| Non-loopback IPs            | xxxxxx, xxxxxx |
| Enabled MACs                | xxxxxx
|
| Maximum active jobs         | 1
      |
| Current active jobs         | 0
      |
| Average active jobs         | 0.01
       |
| Maximum waiting jobs        | 0
      |
| Current waiting jobs        | 0
      |
| Average waiting jobs        | 0.00
       |
| Maximum rejected jobs       | 0
      |
| Current rejected jobs       | 0
      |
| Average rejected jobs       | 0.00
       |
| Maximum cancelled jobs      | 0
      |
| Current cancelled jobs      | 0
      |
| Average cancelled jobs      | 0.00
       |
| Total rejected jobs         | 0
      |
| Total executed jobs         | 2
      |
| Total cancelled jobs        | 0
      |
| Maximum job wait time       | 0ms
      |
| Current job wait time       | 0ms
      |
| Average job wait time       | 0.00ms
       |
| Maximum job execute time    | 11ms
       |
| Current job execute time    | 0ms
      |
| Average job execute time    | 5.50ms
       |
| Total busy time             | 5733919ms
      |
| Busy time %                 | 0.21%
      |
| Current CPU load %          | 1.93%
      |
| Average CPU load %          | 4.35%
      |
| Heap memory initialized     | 504mb
      |
| Heap memory used            | 310mb
      |
| Heap memory committed       | 556mb
      |
| Heap memory maximum         | 8gb
      |
| Non-heap memory initialized | 2mb
      |
| Non-heap memory used        | 114mb
      |
| Non-heap memory committed   | 119mb
      |
| Non-heap memory maximum     | 0
      |
| Current thread count        | 125
      |
| Maximum thread count        | 140
      |
| Total started thread count  | 409025
       |
| Current daemon thread count | 15
       |
+---------------------------------------------------------------------------------+

Data region metrics:
+==========================================================================================================================+
|       Name       | Page size |       Pages        |    Memory     |
 Rates       | Checkpoint buffer | Large entries |
+==========================================================================================================================+
| Default_Region   | 0         | Total:  307665     | Total:  1gb   |
Allocation: 0.00 | Pages: 0          | 0.00%         |
|                  |           | Dirty:  0          | In RAM: 0     |
Eviction:   0.00 | Size:  0          |               |
|                  |           | Memory: 0          |               |
Replace:    0.00 |                   |               |
|                  |           | Fill factor: 0.00% |               |
           |                   |               |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| metastoreMemPlc  | 0         | Total:  57         | Total:  228kb |
Allocation: 0.00 | Pages: 0          | 0.00%         |
|                  |           | Dirty:  0          | In RAM: 0     |
Eviction:   0.00 | Size:  0          |               |
|                  |           | Memory: 0          |               |
Replace:    0.00 |                   |               |
|                  |           | Fill factor: 0.00% |               |
           |                   |               |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| sysMemPlc        | 0         | Total:  5          | Total:  20kb  |
Allocation: 0.00 | Pages: 0          | 0.00%         |
|                  |           | Dirty:  0          | In RAM: 0     |
Eviction:   0.00 | Size:  0          |               |
|                  |           | Memory: 0          |               |
Replace:    0.00 |                   |               |
|                  |           | Fill factor: 0.00% |               |
           |                   |               |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| TxLog            | 0         | Total:  0          | Total:  0     |
Allocation: 0.00 | Pages: 0          | 0.00%         |
|                  |           | Dirty:  0          | In RAM: 0     |
Eviction:   0.00 | Size:  0          |               |
|                  |           | Memory: 0          |               |
Replace:    0.00 |                   |               |
|                  |           | Fill factor: 0.00% |               |
           |                   |               |
+------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
| volatileDsMemPlc | 0         | Total:  0          | Total:  0     |
Allocation: 0.00 | Pages: 0          | 0.00%         |
|                  |           | Dirty:  0          | In RAM: 0     |
Eviction:   0.00 | Size:  0          |               |
|                  |           | Memory: 0          |               |
Replace:    0.00 |                   |               |
|                  |           | Fill factor: 0.00% |               |
           |                   |               |
+--------------------------------------------------------------------------------------------------------------------------+

Server nodes config...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
fi

#
# Uncomment the following GC settings if you see spikes in your throughput
due to Garbage Collection.
#
# JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
-XX:MaxDirectMemorySize=256m"

And we use this as our persistence config...

      <property name="dataStorageConfiguration">
        <bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
          <property name="writeThrottlingEnabled" value="true"/>

          <!-- Redefining the default region's settings -->
          <property name="defaultDataRegionConfiguration">
            <bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
              <property name="persistenceEnabled" value="true"/>

              <property name="name" value="Default_Region"/>
              <property name="maxSize" value="#{10L * 1024 * 1024 * 1024}"/>
            </bean>
          </property>
        </bean>
      </property>

On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <[email protected]>
wrote:

> There's a lot going on in that log file. It makes it difficult to tell
> what *the* issue is. You have lots of nodes leaving (and joining) the
> cluster, including server nodes. You have lost partitions and long JVM
> pauses. I suspect the real cause of this node shutting down was that it
> became segmented.
>
> Chances are the issue is either a genuine network issue or the long JVM
> pauses -- which means that the nodes are not talking to each other --
> caused the cluster to fall apart.
>

Reply via email to