Re: Why wpuld a client node error cause server node to shut off?

John Smith Tue, 31 Oct 2023 08:34:47 -0700

I understand you have no time and I have also followed that link. My nodes
are 32GB and I have allocated 8GB for heap and some for off-heap. So I'm
def not hitting some ceiling where it needs to try to force some huge
garbage collection.


What 'i'm asking based on the config and stats I gave do you see anything
that sticks out in those configs not the logs?

On Tue, Oct 31, 2023 at 10:42 AM Stephen Darlington <[email protected]>
wrote:

> No, sorry, the issue is that I don't have the time to go through 25,000
> lines of log file. As I said, your cluster had network or long JVM pause
> issues, probably the latter:
>
> [21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
> Possible too long JVM pause: 63356 milliseconds.
>
> When nodes are continually talking to one another, no Ignite code being
> executed for over a minute is going to be a *big* problem. You need to
> tune your JVM. There are some hints in the documentation:
> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning
>
>
> On Tue, 31 Oct 2023 at 13:16, John Smith <[email protected]> wrote:
>
>> Does any of this infor help? I included what we do more or less plus
>> stats and configs.
>>
>> There are 9 caches of which the biggest one is 5 million records
>> (partitioned with 1 backup), the key is String (11 chars) and the value
>> integer.
>>
>> The rest are replicated and some partitioned but max a few thousand
>> records at best.
>>
>> The nodes are 32GB here is the output of the free -m
>>
>>               total        used        free      shared  buff/cache
>> available
>> Mem:          32167        2521       26760           0        2885
>> 29222
>> Swap:          2047           0        2047
>>
>> And here is node stats:
>>
>> Time of the snapshot: 2023-10-31 13:08:56
>>
>> +---------------------------------------------------------------------------------+
>> | ID                          | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
>>          |
>> | ID8                         | E8044C1A
>>          |
>> | Consistent ID               | b14350a9-6963-442c-9529-14f70f95a6d9
>>          |
>> | Node Type                   | Server
>>          |
>> | Order                       | 2660
>>          |
>> | Address (0)                 | xxxxxx
>>   |
>> | Address (1)                 | 127.0.0.1
>>         |
>> | Address (2)                 | 0:0:0:0:0:0:0:1%lo
>>          |
>> | OS info                     | Linux amd64 4.15.0-197-generic
>>          |
>> | OS user                     | ignite
>>          |
>> | Deployment mode             | SHARED
>>          |
>> | Language runtime            | Java Platform API Specification ver. 1.8
>>          |
>> | Ignite version              | 2.12.0
>>          |
>> | Ignite instance name        | xxxxxx
>>         |
>> | JRE information             | HotSpot 64-Bit Tiered Compilers
>>         |
>> | JVM start time              | 2023-09-29 14:50:39
>>         |
>> | Node start time             | 2023-09-29 14:54:34
>>         |
>> | Up time                     | 09:28:57.946
>>          |
>> | CPUs                        | 4
>>         |
>> | Last metric update          | 2023-10-31 13:07:49
>>         |
>> | Non-loopback IPs            | xxxxxx, xxxxxx |
>> | Enabled MACs                | xxxxxx
>>   |
>> | Maximum active jobs         | 1
>>         |
>> | Current active jobs         | 0
>>         |
>> | Average active jobs         | 0.01
>>          |
>> | Maximum waiting jobs        | 0
>>         |
>> | Current waiting jobs        | 0
>>         |
>> | Average waiting jobs        | 0.00
>>          |
>> | Maximum rejected jobs       | 0
>>         |
>> | Current rejected jobs       | 0
>>         |
>> | Average rejected jobs       | 0.00
>>          |
>> | Maximum cancelled jobs      | 0
>>         |
>> | Current cancelled jobs      | 0
>>         |
>> | Average cancelled jobs      | 0.00
>>          |
>> | Total rejected jobs         | 0
>>         |
>> | Total executed jobs         | 2
>>         |
>> | Total cancelled jobs        | 0
>>         |
>> | Maximum job wait time       | 0ms
>>         |
>> | Current job wait time       | 0ms
>>         |
>> | Average job wait time       | 0.00ms
>>          |
>> | Maximum job execute time    | 11ms
>>          |
>> | Current job execute time    | 0ms
>>         |
>> | Average job execute time    | 5.50ms
>>          |
>> | Total busy time             | 5733919ms
>>         |
>> | Busy time %                 | 0.21%
>>         |
>> | Current CPU load %          | 1.93%
>>         |
>> | Average CPU load %          | 4.35%
>>         |
>> | Heap memory initialized     | 504mb
>>         |
>> | Heap memory used            | 310mb
>>         |
>> | Heap memory committed       | 556mb
>>         |
>> | Heap memory maximum         | 8gb
>>         |
>> | Non-heap memory initialized | 2mb
>>         |
>> | Non-heap memory used        | 114mb
>>         |
>> | Non-heap memory committed   | 119mb
>>         |
>> | Non-heap memory maximum     | 0
>>         |
>> | Current thread count        | 125
>>         |
>> | Maximum thread count        | 140
>>         |
>> | Total started thread count  | 409025
>>          |
>> | Current daemon thread count | 15
>>          |
>>
>> +---------------------------------------------------------------------------------+
>>
>> Data region metrics:
>>
>> +==========================================================================================================================+
>> |       Name       | Page size |       Pages        |    Memory     |
>>  Rates       | Checkpoint buffer | Large entries |
>>
>> +==========================================================================================================================+
>> | Default_Region   | 0         | Total:  307665     | Total:  1gb   |
>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>> |                  |           | Dirty:  0          | In RAM: 0     |
>> Eviction:   0.00 | Size:  0          |               |
>> |                  |           | Memory: 0          |               |
>> Replace:    0.00 |                   |               |
>> |                  |           | Fill factor: 0.00% |               |
>>              |                   |               |
>>
>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>> | metastoreMemPlc  | 0         | Total:  57         | Total:  228kb |
>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>> |                  |           | Dirty:  0          | In RAM: 0     |
>> Eviction:   0.00 | Size:  0          |               |
>> |                  |           | Memory: 0          |               |
>> Replace:    0.00 |                   |               |
>> |                  |           | Fill factor: 0.00% |               |
>>              |                   |               |
>>
>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>> | sysMemPlc        | 0         | Total:  5          | Total:  20kb  |
>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>> |                  |           | Dirty:  0          | In RAM: 0     |
>> Eviction:   0.00 | Size:  0          |               |
>> |                  |           | Memory: 0          |               |
>> Replace:    0.00 |                   |               |
>> |                  |           | Fill factor: 0.00% |               |
>>              |                   |               |
>>
>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>> | TxLog            | 0         | Total:  0          | Total:  0     |
>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>> |                  |           | Dirty:  0          | In RAM: 0     |
>> Eviction:   0.00 | Size:  0          |               |
>> |                  |           | Memory: 0          |               |
>> Replace:    0.00 |                   |               |
>> |                  |           | Fill factor: 0.00% |               |
>>              |                   |               |
>>
>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>> | volatileDsMemPlc | 0         | Total:  0          | Total:  0     |
>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>> |                  |           | Dirty:  0          | In RAM: 0     |
>> Eviction:   0.00 | Size:  0          |               |
>> |                  |           | Memory: 0          |               |
>> Replace:    0.00 |                   |               |
>> |                  |           | Fill factor: 0.00% |               |
>>              |                   |               |
>>
>> +--------------------------------------------------------------------------------------------------------------------------+
>>
>> Server nodes config...
>>
>> if [ -z "$JVM_OPTS" ] ; then
>>     JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
>> fi
>>
>> #
>> # Uncomment the following GC settings if you see spikes in your
>> throughput due to Garbage Collection.
>> #
>> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
>> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>> -XX:MaxDirectMemorySize=256m"
>>
>> And we use this as our persistence config...
>>
>>       <property name="dataStorageConfiguration">
>>         <bean
>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>           <property name="writeThrottlingEnabled" value="true"/>
>>
>>           <!-- Redefining the default region's settings -->
>>           <property name="defaultDataRegionConfiguration">
>>             <bean
>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>               <property name="persistenceEnabled" value="true"/>
>>
>>               <property name="name" value="Default_Region"/>
>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>> 1024}"/>
>>             </bean>
>>           </property>
>>         </bean>
>>       </property>
>>
>> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <
>> [email protected]> wrote:
>>
>>> There's a lot going on in that log file. It makes it difficult to tell
>>> what *the* issue is. You have lots of nodes leaving (and joining) the
>>> cluster, including server nodes. You have lost partitions and long JVM
>>> pauses. I suspect the real cause of this node shutting down was that it
>>> became segmented.
>>>
>>> Chances are the issue is either a genuine network issue or the long JVM
>>> pauses -- which means that the nodes are not talking to each other --
>>> caused the cluster to fall apart.
>>>
>>

Re: Why wpuld a client node error cause server node to shut off?

Reply via email to