Re: Why wpuld a client node error cause server node to shut off?

John Smith Thu, 02 Nov 2023 08:37:40 -0700

You mean on XML config? Ok I'll check it.

Thanks


On Wed, Nov 1, 2023 at 5:14 AM Stephen Darlington <[email protected]>
wrote:

> There are lots of "throttling" warnings. It could be as simple as your
> cluster is at its limit. Faster or more disks might help, as might scaling
> out. The other is that you've enabled write throttling.
> Counter-intuitively, you might want to *dis*able that. It'll still do
> write throttling, just using a different algorithm.
>
> On Tue, 31 Oct 2023 at 15:35, John Smith <[email protected]> wrote:
>
>> I understand you have no time and I have also followed that link. My
>> nodes are 32GB and I have allocated 8GB for heap and some for off-heap. So
>> I'm def not hitting some ceiling where it needs to try to force some huge
>> garbage collection.
>>
>> What 'i'm asking based on the config and stats I gave do you see anything
>> that sticks out in those configs not the logs?
>>
>> On Tue, Oct 31, 2023 at 10:42 AM Stephen Darlington <
>> [email protected]> wrote:
>>
>>> No, sorry, the issue is that I don't have the time to go through 25,000
>>> lines of log file. As I said, your cluster had network or long JVM pause
>>> issues, probably the latter:
>>>
>>> [21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx]
>>> Possible too long JVM pause: 63356 milliseconds.
>>>
>>> When nodes are continually talking to one another, no Ignite code being
>>> executed for over a minute is going to be a *big* problem. You need to
>>> tune your JVM. There are some hints in the documentation:
>>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning
>>>
>>>
>>> On Tue, 31 Oct 2023 at 13:16, John Smith <[email protected]> wrote:
>>>
>>>> Does any of this infor help? I included what we do more or less plus
>>>> stats and configs.
>>>>
>>>> There are 9 caches of which the biggest one is 5 million records
>>>> (partitioned with 1 backup), the key is String (11 chars) and the value
>>>> integer.
>>>>
>>>> The rest are replicated and some partitioned but max a few thousand
>>>> records at best.
>>>>
>>>> The nodes are 32GB here is the output of the free -m
>>>>
>>>>               total        used        free      shared  buff/cache
>>>> available
>>>> Mem:          32167        2521       26760           0        2885
>>>>   29222
>>>> Swap:          2047           0        2047
>>>>
>>>> And here is node stats:
>>>>
>>>> Time of the snapshot: 2023-10-31 13:08:56
>>>>
>>>> +---------------------------------------------------------------------------------+
>>>> | ID                          | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e
>>>>            |
>>>> | ID8                         | E8044C1A
>>>>            |
>>>> | Consistent ID               | b14350a9-6963-442c-9529-14f70f95a6d9
>>>>            |
>>>> | Node Type                   | Server
>>>>            |
>>>> | Order                       | 2660
>>>>            |
>>>> | Address (0)                 | xxxxxx
>>>>     |
>>>> | Address (1)                 | 127.0.0.1
>>>>           |
>>>> | Address (2)                 | 0:0:0:0:0:0:0:1%lo
>>>>            |
>>>> | OS info                     | Linux amd64 4.15.0-197-generic
>>>>            |
>>>> | OS user                     | ignite
>>>>            |
>>>> | Deployment mode             | SHARED
>>>>            |
>>>> | Language runtime            | Java Platform API Specification ver.
>>>> 1.8          |
>>>> | Ignite version              | 2.12.0
>>>>            |
>>>> | Ignite instance name        | xxxxxx
>>>>           |
>>>> | JRE information             | HotSpot 64-Bit Tiered Compilers
>>>>           |
>>>> | JVM start time              | 2023-09-29 14:50:39
>>>>           |
>>>> | Node start time             | 2023-09-29 14:54:34
>>>>           |
>>>> | Up time                     | 09:28:57.946
>>>>            |
>>>> | CPUs                        | 4
>>>>           |
>>>> | Last metric update          | 2023-10-31 13:07:49
>>>>           |
>>>> | Non-loopback IPs            | xxxxxx, xxxxxx |
>>>> | Enabled MACs                | xxxxxx
>>>>     |
>>>> | Maximum active jobs         | 1
>>>>           |
>>>> | Current active jobs         | 0
>>>>           |
>>>> | Average active jobs         | 0.01
>>>>            |
>>>> | Maximum waiting jobs        | 0
>>>>           |
>>>> | Current waiting jobs        | 0
>>>>           |
>>>> | Average waiting jobs        | 0.00
>>>>            |
>>>> | Maximum rejected jobs       | 0
>>>>           |
>>>> | Current rejected jobs       | 0
>>>>           |
>>>> | Average rejected jobs       | 0.00
>>>>            |
>>>> | Maximum cancelled jobs      | 0
>>>>           |
>>>> | Current cancelled jobs      | 0
>>>>           |
>>>> | Average cancelled jobs      | 0.00
>>>>            |
>>>> | Total rejected jobs         | 0
>>>>           |
>>>> | Total executed jobs         | 2
>>>>           |
>>>> | Total cancelled jobs        | 0
>>>>           |
>>>> | Maximum job wait time       | 0ms
>>>>           |
>>>> | Current job wait time       | 0ms
>>>>           |
>>>> | Average job wait time       | 0.00ms
>>>>            |
>>>> | Maximum job execute time    | 11ms
>>>>            |
>>>> | Current job execute time    | 0ms
>>>>           |
>>>> | Average job execute time    | 5.50ms
>>>>            |
>>>> | Total busy time             | 5733919ms
>>>>           |
>>>> | Busy time %                 | 0.21%
>>>>           |
>>>> | Current CPU load %          | 1.93%
>>>>           |
>>>> | Average CPU load %          | 4.35%
>>>>           |
>>>> | Heap memory initialized     | 504mb
>>>>           |
>>>> | Heap memory used            | 310mb
>>>>           |
>>>> | Heap memory committed       | 556mb
>>>>           |
>>>> | Heap memory maximum         | 8gb
>>>>           |
>>>> | Non-heap memory initialized | 2mb
>>>>           |
>>>> | Non-heap memory used        | 114mb
>>>>           |
>>>> | Non-heap memory committed   | 119mb
>>>>           |
>>>> | Non-heap memory maximum     | 0
>>>>           |
>>>> | Current thread count        | 125
>>>>           |
>>>> | Maximum thread count        | 140
>>>>           |
>>>> | Total started thread count  | 409025
>>>>            |
>>>> | Current daemon thread count | 15
>>>>            |
>>>>
>>>> +---------------------------------------------------------------------------------+
>>>>
>>>> Data region metrics:
>>>>
>>>> +==========================================================================================================================+
>>>> |       Name       | Page size |       Pages        |    Memory     |
>>>>    Rates       | Checkpoint buffer | Large entries |
>>>>
>>>> +==========================================================================================================================+
>>>> | Default_Region   | 0         | Total:  307665     | Total:  1gb   |
>>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>>> Eviction:   0.00 | Size:  0          |               |
>>>> |                  |           | Memory: 0          |               |
>>>> Replace:    0.00 |                   |               |
>>>> |                  |           | Fill factor: 0.00% |               |
>>>>                |                   |               |
>>>>
>>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>>> | metastoreMemPlc  | 0         | Total:  57         | Total:  228kb |
>>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>>> Eviction:   0.00 | Size:  0          |               |
>>>> |                  |           | Memory: 0          |               |
>>>> Replace:    0.00 |                   |               |
>>>> |                  |           | Fill factor: 0.00% |               |
>>>>                |                   |               |
>>>>
>>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>>> | sysMemPlc        | 0         | Total:  5          | Total:  20kb  |
>>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>>> Eviction:   0.00 | Size:  0          |               |
>>>> |                  |           | Memory: 0          |               |
>>>> Replace:    0.00 |                   |               |
>>>> |                  |           | Fill factor: 0.00% |               |
>>>>                |                   |               |
>>>>
>>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>>> | TxLog            | 0         | Total:  0          | Total:  0     |
>>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>>> Eviction:   0.00 | Size:  0          |               |
>>>> |                  |           | Memory: 0          |               |
>>>> Replace:    0.00 |                   |               |
>>>> |                  |           | Fill factor: 0.00% |               |
>>>>                |                   |               |
>>>>
>>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+
>>>> | volatileDsMemPlc | 0         | Total:  0          | Total:  0     |
>>>> Allocation: 0.00 | Pages: 0          | 0.00%         |
>>>> |                  |           | Dirty:  0          | In RAM: 0     |
>>>> Eviction:   0.00 | Size:  0          |               |
>>>> |                  |           | Memory: 0          |               |
>>>> Replace:    0.00 |                   |               |
>>>> |                  |           | Fill factor: 0.00% |               |
>>>>                |                   |               |
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> Server nodes config...
>>>>
>>>> if [ -z "$JVM_OPTS" ] ; then
>>>>     JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m"
>>>> fi
>>>>
>>>> #
>>>> # Uncomment the following GC settings if you see spikes in your
>>>> throughput due to Garbage Collection.
>>>> #
>>>> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
>>>> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC
>>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>>> -XX:MaxDirectMemorySize=256m"
>>>>
>>>> And we use this as our persistence config...
>>>>
>>>>       <property name="dataStorageConfiguration">
>>>>         <bean
>>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>>           <property name="writeThrottlingEnabled" value="true"/>
>>>>
>>>>           <!-- Redefining the default region's settings -->
>>>>           <property name="defaultDataRegionConfiguration">
>>>>             <bean
>>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>>               <property name="persistenceEnabled" value="true"/>
>>>>
>>>>               <property name="name" value="Default_Region"/>
>>>>               <property name="maxSize" value="#{10L * 1024 * 1024 *
>>>> 1024}"/>
>>>>             </bean>
>>>>           </property>
>>>>         </bean>
>>>>       </property>
>>>>
>>>> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington <
>>>> [email protected]> wrote:
>>>>
>>>>> There's a lot going on in that log file. It makes it difficult to tell
>>>>> what *the* issue is. You have lots of nodes leaving (and joining) the
>>>>> cluster, including server nodes. You have lost partitions and long JVM
>>>>> pauses. I suspect the real cause of this node shutting down was that it
>>>>> became segmented.
>>>>>
>>>>> Chances are the issue is either a genuine network issue or the long
>>>>> JVM pauses -- which means that the nodes are not talking to each other --
>>>>> caused the cluster to fall apart.
>>>>>
>>>>

Re: Why wpuld a client node error cause server node to shut off?

Reply via email to