Hi,
We have a 4 node cluster with 64G RAM and 40G DISK per node attached for
/work and /walarchive each. WAL dir is 10G per node
Below is our data region configuration and jvm_opts. We are getting
timeouts on checkpointing and WalArchive is getting filled up and no data
is moving to the /work directory. Checkpointing error is mentioned below
could you suggest what's wrong with these configs(Note : with
walarchveSize as 16G and checkpointingBUffSize as 4G, it was writing but
very slow and throttling rate was 35% so I increased it, but it started
showing timeout errors)
JVM_OPTS: -XX:MaxDirectMemorySize=2g -Xms20g -Xmx25g
-XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
Ignite config:
<property name="dataStorageConfiguration">
<bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="walBufferSize" value="#{256L * 1024 *
1024}"/>
<property name="checkpointFrequency" value="30000"/>
<property name="checkpointThreads" value="12"/>
<property name="walSegmentSize" value="#{512L * 1024 *
1024}"/>
<property name="maxWalArchiveSize" value="#{32L * 1024 *
1024 * 1024}"/>
<property name="writeThrottlingEnabled" value="true"/>
<property name="defaultDataRegionConfiguration">
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
<!--
https://ignite.apache.org/docs/latest/persistence/persistence-tuning#adjusting-checkpointing-buffer-size--
>
<property name="checkpointPageBufferSize"
value="#{8L * 1024 * 1024 * 1024}"/>
<!--<property name="pageReplacementMode"
value="SEGMENTED_LRU"/>-->
</bean>
</property>
<property name="walPath" value="/ignite/wal"/>
<property name="walArchivePath" value="/ignite/walarchive"/>
</bean>
</property>
Checkpointing logs:
[14:56:55,209][INFO][db-checkpoint-thread-#105][Checkpointer] Checkpoint
started [checkpointId=4b071279-8f25-4cd0-a5ef-e6f58f3a5653,
startPtr=WALPointer [idx=930, fileOff=527943956, len=51411],
checkpointBeforeLockTime=9ms, checkpointLockWait=0ms,
checkpointListenersExecuteTime=6ms, checkpointLockHoldTime=9ms,
walCpRecordFsyncDuration=7ms, writeCheckpointEntryDuration=3ms,
splitAndSortCpPagesDuration=45ms, pages=35542, reason='timeout']
[14:56:55,921][INFO][db-checkpoint-thread-#105][Checkpointer] Checkpoint
finished [cpId=4b071279-8f25-4cd0-a5ef-e6f58f3a5653, pages=35542,
markPos=WALPointer [idx=930, fileOff=527943956, len=51411],
walSegmentsCovered=[], markDuration=64ms, pagesWrite=108ms, fsync=604ms,
total=785ms]