You mean on XML config? Ok I'll check it. Thanks
On Wed, Nov 1, 2023 at 5:14 AM Stephen Darlington <[email protected]> wrote: > There are lots of "throttling" warnings. It could be as simple as your > cluster is at its limit. Faster or more disks might help, as might scaling > out. The other is that you've enabled write throttling. > Counter-intuitively, you might want to *dis*able that. It'll still do > write throttling, just using a different algorithm. > > On Tue, 31 Oct 2023 at 15:35, John Smith <[email protected]> wrote: > >> I understand you have no time and I have also followed that link. My >> nodes are 32GB and I have allocated 8GB for heap and some for off-heap. So >> I'm def not hitting some ceiling where it needs to try to force some huge >> garbage collection. >> >> What 'i'm asking based on the config and stats I gave do you see anything >> that sticks out in those configs not the logs? >> >> On Tue, Oct 31, 2023 at 10:42 AM Stephen Darlington < >> [email protected]> wrote: >> >>> No, sorry, the issue is that I don't have the time to go through 25,000 >>> lines of log file. As I said, your cluster had network or long JVM pause >>> issues, probably the latter: >>> >>> [21:37:12,517][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] >>> Possible too long JVM pause: 63356 milliseconds. >>> >>> When nodes are continually talking to one another, no Ignite code being >>> executed for over a minute is going to be a *big* problem. You need to >>> tune your JVM. There are some hints in the documentation: >>> https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning >>> >>> >>> On Tue, 31 Oct 2023 at 13:16, John Smith <[email protected]> wrote: >>> >>>> Does any of this infor help? I included what we do more or less plus >>>> stats and configs. >>>> >>>> There are 9 caches of which the biggest one is 5 million records >>>> (partitioned with 1 backup), the key is String (11 chars) and the value >>>> integer. >>>> >>>> The rest are replicated and some partitioned but max a few thousand >>>> records at best. >>>> >>>> The nodes are 32GB here is the output of the free -m >>>> >>>> total used free shared buff/cache >>>> available >>>> Mem: 32167 2521 26760 0 2885 >>>> 29222 >>>> Swap: 2047 0 2047 >>>> >>>> And here is node stats: >>>> >>>> Time of the snapshot: 2023-10-31 13:08:56 >>>> >>>> +---------------------------------------------------------------------------------+ >>>> | ID | e8044c1a-6e0d-4f94-9a04-0711a3d7fc6e >>>> | >>>> | ID8 | E8044C1A >>>> | >>>> | Consistent ID | b14350a9-6963-442c-9529-14f70f95a6d9 >>>> | >>>> | Node Type | Server >>>> | >>>> | Order | 2660 >>>> | >>>> | Address (0) | xxxxxx >>>> | >>>> | Address (1) | 127.0.0.1 >>>> | >>>> | Address (2) | 0:0:0:0:0:0:0:1%lo >>>> | >>>> | OS info | Linux amd64 4.15.0-197-generic >>>> | >>>> | OS user | ignite >>>> | >>>> | Deployment mode | SHARED >>>> | >>>> | Language runtime | Java Platform API Specification ver. >>>> 1.8 | >>>> | Ignite version | 2.12.0 >>>> | >>>> | Ignite instance name | xxxxxx >>>> | >>>> | JRE information | HotSpot 64-Bit Tiered Compilers >>>> | >>>> | JVM start time | 2023-09-29 14:50:39 >>>> | >>>> | Node start time | 2023-09-29 14:54:34 >>>> | >>>> | Up time | 09:28:57.946 >>>> | >>>> | CPUs | 4 >>>> | >>>> | Last metric update | 2023-10-31 13:07:49 >>>> | >>>> | Non-loopback IPs | xxxxxx, xxxxxx | >>>> | Enabled MACs | xxxxxx >>>> | >>>> | Maximum active jobs | 1 >>>> | >>>> | Current active jobs | 0 >>>> | >>>> | Average active jobs | 0.01 >>>> | >>>> | Maximum waiting jobs | 0 >>>> | >>>> | Current waiting jobs | 0 >>>> | >>>> | Average waiting jobs | 0.00 >>>> | >>>> | Maximum rejected jobs | 0 >>>> | >>>> | Current rejected jobs | 0 >>>> | >>>> | Average rejected jobs | 0.00 >>>> | >>>> | Maximum cancelled jobs | 0 >>>> | >>>> | Current cancelled jobs | 0 >>>> | >>>> | Average cancelled jobs | 0.00 >>>> | >>>> | Total rejected jobs | 0 >>>> | >>>> | Total executed jobs | 2 >>>> | >>>> | Total cancelled jobs | 0 >>>> | >>>> | Maximum job wait time | 0ms >>>> | >>>> | Current job wait time | 0ms >>>> | >>>> | Average job wait time | 0.00ms >>>> | >>>> | Maximum job execute time | 11ms >>>> | >>>> | Current job execute time | 0ms >>>> | >>>> | Average job execute time | 5.50ms >>>> | >>>> | Total busy time | 5733919ms >>>> | >>>> | Busy time % | 0.21% >>>> | >>>> | Current CPU load % | 1.93% >>>> | >>>> | Average CPU load % | 4.35% >>>> | >>>> | Heap memory initialized | 504mb >>>> | >>>> | Heap memory used | 310mb >>>> | >>>> | Heap memory committed | 556mb >>>> | >>>> | Heap memory maximum | 8gb >>>> | >>>> | Non-heap memory initialized | 2mb >>>> | >>>> | Non-heap memory used | 114mb >>>> | >>>> | Non-heap memory committed | 119mb >>>> | >>>> | Non-heap memory maximum | 0 >>>> | >>>> | Current thread count | 125 >>>> | >>>> | Maximum thread count | 140 >>>> | >>>> | Total started thread count | 409025 >>>> | >>>> | Current daemon thread count | 15 >>>> | >>>> >>>> +---------------------------------------------------------------------------------+ >>>> >>>> Data region metrics: >>>> >>>> +==========================================================================================================================+ >>>> | Name | Page size | Pages | Memory | >>>> Rates | Checkpoint buffer | Large entries | >>>> >>>> +==========================================================================================================================+ >>>> | Default_Region | 0 | Total: 307665 | Total: 1gb | >>>> Allocation: 0.00 | Pages: 0 | 0.00% | >>>> | | | Dirty: 0 | In RAM: 0 | >>>> Eviction: 0.00 | Size: 0 | | >>>> | | | Memory: 0 | | >>>> Replace: 0.00 | | | >>>> | | | Fill factor: 0.00% | | >>>> | | | >>>> >>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >>>> | metastoreMemPlc | 0 | Total: 57 | Total: 228kb | >>>> Allocation: 0.00 | Pages: 0 | 0.00% | >>>> | | | Dirty: 0 | In RAM: 0 | >>>> Eviction: 0.00 | Size: 0 | | >>>> | | | Memory: 0 | | >>>> Replace: 0.00 | | | >>>> | | | Fill factor: 0.00% | | >>>> | | | >>>> >>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >>>> | sysMemPlc | 0 | Total: 5 | Total: 20kb | >>>> Allocation: 0.00 | Pages: 0 | 0.00% | >>>> | | | Dirty: 0 | In RAM: 0 | >>>> Eviction: 0.00 | Size: 0 | | >>>> | | | Memory: 0 | | >>>> Replace: 0.00 | | | >>>> | | | Fill factor: 0.00% | | >>>> | | | >>>> >>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >>>> | TxLog | 0 | Total: 0 | Total: 0 | >>>> Allocation: 0.00 | Pages: 0 | 0.00% | >>>> | | | Dirty: 0 | In RAM: 0 | >>>> Eviction: 0.00 | Size: 0 | | >>>> | | | Memory: 0 | | >>>> Replace: 0.00 | | | >>>> | | | Fill factor: 0.00% | | >>>> | | | >>>> >>>> +------------------+-----------+--------------------+---------------+------------------+-------------------+---------------+ >>>> | volatileDsMemPlc | 0 | Total: 0 | Total: 0 | >>>> Allocation: 0.00 | Pages: 0 | 0.00% | >>>> | | | Dirty: 0 | In RAM: 0 | >>>> Eviction: 0.00 | Size: 0 | | >>>> | | | Memory: 0 | | >>>> Replace: 0.00 | | | >>>> | | | Fill factor: 0.00% | | >>>> | | | >>>> >>>> +--------------------------------------------------------------------------------------------------------------------------+ >>>> >>>> Server nodes config... >>>> >>>> if [ -z "$JVM_OPTS" ] ; then >>>> JVM_OPTS="-Xms8g -Xmx8g -server -XX:MaxMetaspaceSize=256m" >>>> fi >>>> >>>> # >>>> # Uncomment the following GC settings if you see spikes in your >>>> throughput due to Garbage Collection. >>>> # >>>> # JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" >>>> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch -XX:+UseG1GC >>>> -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC >>>> -XX:MaxDirectMemorySize=256m" >>>> >>>> And we use this as our persistence config... >>>> >>>> <property name="dataStorageConfiguration"> >>>> <bean >>>> class="org.apache.ignite.configuration.DataStorageConfiguration"> >>>> <property name="writeThrottlingEnabled" value="true"/> >>>> >>>> <!-- Redefining the default region's settings --> >>>> <property name="defaultDataRegionConfiguration"> >>>> <bean >>>> class="org.apache.ignite.configuration.DataRegionConfiguration"> >>>> <property name="persistenceEnabled" value="true"/> >>>> >>>> <property name="name" value="Default_Region"/> >>>> <property name="maxSize" value="#{10L * 1024 * 1024 * >>>> 1024}"/> >>>> </bean> >>>> </property> >>>> </bean> >>>> </property> >>>> >>>> On Tue, Oct 31, 2023 at 5:27 AM Stephen Darlington < >>>> [email protected]> wrote: >>>> >>>>> There's a lot going on in that log file. It makes it difficult to tell >>>>> what *the* issue is. You have lots of nodes leaving (and joining) the >>>>> cluster, including server nodes. You have lost partitions and long JVM >>>>> pauses. I suspect the real cause of this node shutting down was that it >>>>> became segmented. >>>>> >>>>> Chances are the issue is either a genuine network issue or the long >>>>> JVM pauses -- which means that the nodes are not talking to each other -- >>>>> caused the cluster to fall apart. >>>>> >>>>
