Re: Why wpuld a client node error cause server node to shut off?

Stephen Darlington Tue, 31 Oct 2023 07:42:03 -0700

No, sorry, the issue is that I don't have the time to go through 25,000
lines of log file. As I said, your cluster had network or long JVM pause
issues, probably the latter:


[21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
Possible too long JVM pause: 63356 milliseconds.

When nodes are continually talking to one another, no Ignite code being
executed for over a minute is going to be a *big* problem. You need to tune
your JVM. There are some hints in the documentation:
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning


On Tue, 31 Oct 2023 at 13:16, John Smith <java.dev....@gmail.com> wrote:

> Does any of this infor help? I included what we do more or less plus stats
> and configs.
>
> There are 9 caches of which the biggest one is 5 million records
> (partitioned with 1 backup), the key is String (11 chars) and the value
> integer.
>
> The rest are replicated and some partitioned but max a few thousand
> records at best.
>
> The nodes are 32GB here is the output of the free -m
>
>               total        used        free      shared  buff/cache
> available
> Mem:          32167        2521       26760           0        2885
> 29222
> Swap:          2047           0        2047
>
> And here is node stats:
>
> Time of the snapshot: 2023-10-31 13:08:56
>
> +---------------------------------------------------------------------------------+
> | ID                          | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
>        |
> | ID8                         | E8044C1A
>        |
> | Consistent ID               | b14350a9-6963-442c-9529-14f70f95a6d9
>        |
> | Node Type                   | Server
>        |
> | Order                       | 2660
>        |
> | Address (0)                 | xxxxxx
>   |
> | Address (1)                 | 127.0.0.1
>         |
> | Address (2)                 | 0:0:0:0:0:0:0:1%lo
>        |
> | OS info                     | Linux amd64 4.15.0-197-generic
>        |
> | OS user                     | ignite
>        |
> | Deployment mode             | SHARED
>        |
> | Language runtime            | Java Platform API Specification ver. 1.8
>        |
> | Ignite version              | 2.12.0
>        |
> | Ignite instance name        | xxxxxx
>         |
> | JRE information             | HotSpot 64-Bit Tiered Compilers
>         |
> | JVM start time              | 2023-09-29 14:50:39
>         |
> | Node start time             | 2023-09-29 14:54:34
>         |
> | Up time                     | 09:28:57.946
>        |
> | CPUs                        | 4
>         |
> | Last metric update          | 2023-10-31 13:07:49
>         |
> | Non-loopback IPs            | xxxxxx, xxxxxx |
> | Enabled MACs                | xxxxxx
>   |
> | Maximum active jobs         | 1
>         |
> | Current active jobs         | 0
>         |
> | Average active jobs         | 0.01
>        |
> | Maximum waiting jobs        | 0
>         |
> | Current waiting jobs        | 0
>         |
> | Average waiting jobs        | 0.00
>        |
> | Maximum rejected jobs       | 0
>         |
> | Current rejected jobs       | 0
>         |
> | Average rejected jobs       | 0.00
>        |
> | Maximum cancelled jobs      | 0
>         |
> | Current cancelled jobs      | 0
>         |
> | Average cancelled jobs      | 0.00
>        |
> | Total rejected jobs         | 0
>         |
> | Total executed jobs         | 2
>         |
> | Total cancelled jobs        | 0
>         |
> | Maximum job wait time       | 0ms
>         |
> | Current job wait time       | 0ms
>         |
> | Average job wait time       | 0.00ms
>        |
> | Maximum job execute time    | 11ms
>        |
> | Current job execute time    | 0ms
>         |
> | Average job execute time    | 5.50ms
>        |
> | Total busy time             | 5733919ms
>         |
> | Busy time %                 | 0.21%
>         |
> | Current CPU load %          | 1.93%
>         |
> | Average CPU load %          | 4.35%
>         |
> | Heap memory initialized     | 504mb
>         |
> | Heap memory used            | 310mb
>         |
> | Heap memory committed       | 556mb
>         |
> | Heap memory maximum         | 8gb
>         |
> | Non-heap memory initialized | 2mb
>         |
> | Non-heap memory used        | 114mb
>         |
> | Non-heap memory committed   | 119mb
>         |
> | Non-heap memory maximum     | 0
>         |
> | Current thread count        | 125
>         |
> | Maximum thread count        | 140
>         |
> | Total started thread count  | 409025
>        |
> | Current daemon thread count | 15
>        |
>
> +---------------------------------------------------------------------------------+
>
> Data region metrics:
>
> +==========================================================================================================================+
> |       Name       | Page size |       Pages        |    Memory     |
>  Rates       | Checkpoint buffer | Large entries |
>
> +==========================================================================================================================+
> | Default_Region   | 0         | Total:  307665     | Total:  1gb   |
> Allocation: 0.00 | Pages: 0          | 0.00%         |
> |                  |           | Dirty:  0          | In RAM: 0     |
> Eviction:   0.00 | Size:  0          |               |
> |                  |           | Memory: 0          |               |
> Replace:    0.00 |                   |               |
> |                  |           | Fill factor: 0.00% |               |
>              |                   |               |
>
> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
> | metastoreMemPlc  | 0         | Total:  57         | Total:  228kb |
> Allocation: 0.00 | Pages: 0          | 0.00%         |
> |                  |           | Dirty:  0          | In RAM: 0     |
> Eviction:   0.00 | Size:  0          |               |
> |                  |           | Memory: 0          |               |
> Replace:    0.00 |                   |               |
> |                  |           | Fill factor: 0.00% |               |
>              |                   |               |
>
> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
> | sysMemPlc        | 0         | Total:  5          | Total:  20kb  |
> Allocation: 0.00 | Pages: 0          | 0.00%         |
> |                  |           | Dirty:  0          | In RAM: 0     |
> Eviction:   0.00 | Size:  0          |               |
> |                  |           | Memory: 0          |               |
> Replace:    0.00 |                   |               |
> |                  |           | Fill factor: 0.00% |               |
>              |                   |               |
>
> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
> | TxLog            | 0         | Total:  0          | Total:  0     |
> Allocation: 0.00 | Pages: 0          | 0.00%         |
> |                  |           | Dirty:  0          | In RAM: 0     |
> Eviction:   0.00 | Size:  0          |               |
> |                  |           | Memory: 0          |               |
> Replace:    0.00 |                   |               |
> |                  |           | Fill factor: 0.00% |               |
>              |                   |               |
>
> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
> | volatileDsMemPlc | 0         | Total:  0          | Total:  0     |
> Allocation: 0.00 | Pages: 0          | 0.00%         |
> |                  |           | Dirty:  0          | In RAM: 0     |
> Eviction:   0.00 | Size:  0          |               |
> |                  |           | Memory: 0          |               |
> Replace:    0.00 |                   |               |
> |                  |           | Fill factor: 0.00% |               |
>              |                   |               |
>
> +--------------------------------------------------------------------------------------------------------------------------+
>
> Server nodes config...
>
> if [ -z "$JVM_OPTS" ] ; then
>     JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
> fi
>
> #
> # Uncomment the following GC settings if you see spikes in your throughput
> due to Garbage Collection.
> #
> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
> -XX:MaxDirectMemorySize=256m"
>
> And we use this as our persistence config...
>
>       <property name="dataStorageConfiguration">
>         <bean
> class="org.apache.ignite.configuration.DataStorageConfiguration">
>           <property name="writeThrottlingEnabled" value="true"/>
>
>           <!-- Redefining the default region's settings -->
>           <property name="defaultDataRegionConfiguration">
>             <bean
> class="org.apache.ignite.configuration.DataRegionConfiguration">
>               <property name="persistenceEnabled" value="true"/>
>
>               <property name="name" value="Default_Region"/>
>               <property name="maxSize" value="#{10L * 1024 * 1024 *
> 1024}"/>
>             </bean>
>           </property>
>         </bean>
>       </property>
>
> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <sdarling...@apache.org>
> wrote:
>
>> There's a lot going on in that log file. It makes it difficult to tell
>> what *the* issue is. You have lots of nodes leaving (and joining) the
>> cluster, including server nodes. You have lost partitions and long JVM
>> pauses. I suspect the real cause of this node shutting down was that it
>> became segmented.
>>
>> Chances are the issue is either a genuine network issue or the long JVM
>> pauses -- which means that the nodes are not talking to each other --
>> caused the cluster to fall apart.
>>
>

Re: Why wpuld a client node error cause server node to shut off?

Reply via email to