Hello! Is it possible that you have deleted/lost some of WAL files from this instance?
If not, I'm afraid we can only figure it out if you share your PDS files (wal + checkpoint dirs) of affected instance. Regards, -- Ilya Kasnacheev пн, 4 февр. 2019 г. в 10:07, radha jai <jairadhah...@gmail.com>: > I am using the default WAL mode. I think its LOG_ONLY. > The crashed ignite server log is below: > > {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 > 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Read > checkpoint status > [startMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909757044-63969238-f350-4b12-bdf5-f7a540021e58-START.bin, > endMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909575263-435715b4-71a9-4c2b-90ef-d831ed575ffc-END.bin]"} > > {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 > 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Checking > memory state [lastValidPos=FileWALPointer [idx=412, fileOff=50500521, > len=57801], lastMarked=FileWALPointer [idx=426, fileOff=38038736, > len=57801], lastCheckpointId=63969238-f350-4b12-bdf5-f7a540021e58]"} > > {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"WARN","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 > 16:01:29,094","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Ignite > node stopped in the middle of checkpoint. Will restore memory state and > finish checkpoint on node start."} > > {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 > 16:01:29,105","logger":"","timezone":"UTC","marker":"","log":"Critical > system error detected. Will be handled accordingly to configured handler > [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, > failureCtx=FailureContext [type=CRITICAL_ERROR, err=class > o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state > (checkpoint marker is present on disk, but checkpoint record is missed in > WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, > cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer > [idx=426, fileOff=38038736, len=57801], > cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer > [idx=412, fileOff=50500521, len=57801]], lastRead=null]]] class > org.apache.ignite.internal.pagemem.wal.StorageException: Failed to restore > memory state (checkpoint marker is present on disk, but checkpoint record > is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, > cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer > [idx=426, fileOff=38038736, len=57801], > cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer > [idx=412, fileOff=50500521, len=57801]], lastRead=null] > > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:2120) > > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:1929) > > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:755) > > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:789) > > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:674) > > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419) > > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299) > > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > > at java.lang.Thread.run(Thread.java:748) > > "} > > {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 > 16:01:29,106","logger":"","timezone":"UTC","marker":"","log":"JVM will be > halted immediately due to the failure: [failureCtx=FailureContext > [type=CRITICAL_ERROR, > err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory > state (checkpoint marker is present on disk, but checkpoint record is > missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, > cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer > [idx=426, fileOff=38038736, len=57801], > cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer > [idx=412, fileOff=50500521, len=57801]], lastRead=null]]]"} > > > > Regards > > Krupa > > On Fri, 1 Feb 2019 at 19:50, Ilya Kasnacheev <ilya.kasnach...@gmail.com> > wrote: > >> Hello! >> >> It's hard to say outright. Can you provide full log before node crash? Is >> there a chance that you ran out of disk space? What's your WALMode? >> >> Regards, >> -- >> Ilya Kasnacheev >> >> >> пт, 1 февр. 2019 г. в 08:16, radha jai <jairadhah...@gmail.com>: >> >>> Hi, >>> Ignite has been deployed on k8s has 12 ignite-servers, which are >>> spread out one on each worker node. The limits are 1 CPU 32GB RAM, with >>> maximum of 8 CPU and 64GB. Each ignite-server has a WAL and Persistent >>> storage volume of 30GB. >>> Getting below error after inserting the 60GB of data to ignite >>> cluster, one of the nodes crashes, and never recovers. The error on >>> startup indicates that the WAL fails to restore memory state, >>> type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: >>> Failed to restore memory state (checkpoint marker is present on disk, but >>> checkpoint record is missed in WAL) >>> >>> following warning message are seen in some of the server logs. >>> >>> [03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long >>> JVM pause: 1022 milliseconds. >>> >>> >>> The snippet of ignite configuration is below: >>> >>> >>> <property name="peerClassLoadingEnabled" value="true"/> >>> >>> <property name="dataStorageConfiguration"> >>> >>> <bean >>> class="org.apache.ignite.configuration.DataStorageConfiguration"> >>> >>> <!-- Enable metrics for Ignite persistence --> >>> >>> <property name="metricsEnabled" value="true"/> >>> >>> <property name="defaultDataRegionConfiguration"> >>> >>> <bean >>> class="org.apache.ignite.configuration.DataRegionConfiguration"> >>> >>> >>> <property name="name" value="Default_Region"/> >>> >>> <property name="initialSize" value="#{32L * 1024 * >>> 1024 * 1024}"/> >>> >>> <property name="maxSize" value="#{64L * 1024 * 1024 * >>> 1024}"/> >>> >>> <!-- Enabling Apache Ignite Persistent Store. --> >>> >>> <property name="persistenceEnabled" value="true"/> >>> >>> <!-- Enable metrics for this data region --> >>> >>> <property name="metricsEnabled" value="true"/> >>> >>> </bean> >>> >>> </property> >>> >>> <property name="storagePath" value="/opt/ignite/persistence/"/> >>> >>> <property name="walPath" value="/opt/ignite/wal/"/> >>> >>> </bean> >>> >>> </property> >>> >>> >>> Ignite JVM configuration: -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch >>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC >>> >>> >>> Thanks >>> >>> radha >>> >>> >>> >>>