Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Ilya Kasnacheev Sun, 03 Feb 2019 23:34:49 -0800

Hello!

Is it possible that you have deleted/lost some of WAL files from this
instance?


If not, I'm afraid we can only figure it out if you share your PDS files
(wal + checkpoint dirs) of affected instance.

Regards,
-- 
Ilya Kasnacheev


пн, 4 февр. 2019 г. в 10:07, radha jai <jairadhah...@gmail.com>:

> I am using the default WAL mode. I think its  LOG_ONLY.
> The crashed ignite server log is below:
>
> {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
> 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Read
> checkpoint status
> [startMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909757044-63969238-f350-4b12-bdf5-f7a540021e58-START.bin,
> endMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909575263-435715b4-71a9-4c2b-90ef-d831ed575ffc-END.bin]"}
>
> {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
> 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Checking
> memory state [lastValidPos=FileWALPointer [idx=412, fileOff=50500521,
> len=57801], lastMarked=FileWALPointer [idx=426, fileOff=38038736,
> len=57801], lastCheckpointId=63969238-f350-4b12-bdf5-f7a540021e58]"}
>
> {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"WARN","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
> 16:01:29,094","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Ignite
> node stopped in the middle of checkpoint. Will restore memory state and
> finish checkpoint on node start."}
>
> {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
> 16:01:29,105","logger":"","timezone":"UTC","marker":"","log":"Critical
> system error detected. Will be handled accordingly to configured handler
> [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
> o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state
> (checkpoint marker is present on disk, but checkpoint record is missed in
> WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
> cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
> [idx=426, fileOff=38038736, len=57801],
> cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
> [idx=412, fileOff=50500521, len=57801]], lastRead=null]]] class
> org.apache.ignite.internal.pagemem.wal.StorageException: Failed to restore
> memory state (checkpoint marker is present on disk, but checkpoint record
> is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
> cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
> [idx=426, fileOff=38038736, len=57801],
> cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
> [idx=412, fileOff=50500521, len=57801]], lastRead=null]
>
>         at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:2120)
>
>         at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:1929)
>
>         at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:755)
>
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:789)
>
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:674)
>
>         at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)
>
>         at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)
>
>         at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>
>         at java.lang.Thread.run(Thread.java:748)
>
> "}
>
> {"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
> 16:01:29,106","logger":"","timezone":"UTC","marker":"","log":"JVM will be
> halted immediately due to the failure: [failureCtx=FailureContext 
> [type=CRITICAL_ERROR,
> err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory
> state (checkpoint marker is present on disk, but checkpoint record is
> missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
> cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
> [idx=426, fileOff=38038736, len=57801],
> cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
> [idx=412, fileOff=50500521, len=57801]], lastRead=null]]]"}
>
>
>
> Regards
>
> Krupa
>
> On Fri, 1 Feb 2019 at 19:50, Ilya Kasnacheev <ilya.kasnach...@gmail.com>
> wrote:
>
>> Hello!
>>
>> It's hard to say outright. Can you provide full log before node crash? Is
>> there a chance that you ran out of disk space? What's your WALMode?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 1 февр. 2019 г. в 08:16, radha jai <jairadhah...@gmail.com>:
>>
>>> Hi,
>>>    Ignite has been deployed on k8s has 12 ignite-servers, which are
>>> spread out one on each worker node.  The limits are 1 CPU 32GB RAM, with
>>> maximum of 8 CPU and 64GB.  Each ignite-server has a WAL and Persistent
>>> storage volume of 30GB.
>>>    Getting below error after inserting the 60GB of data to ignite
>>> cluster, one of the nodes crashes, and never recovers.  The error on
>>> startup indicates that the WAL fails to restore memory state,
>>>    type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>>> Failed to restore memory state (checkpoint marker is present on disk, but
>>> checkpoint record is missed in WAL)
>>>
>>> following warning message are seen in some of the server logs.
>>>
>>> [03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long
>>> JVM pause: 1022 milliseconds.
>>>
>>>
>>> The snippet of ignite configuration is below:
>>>
>>>
>>> <property name="peerClassLoadingEnabled" value="true"/>
>>>
>>>  <property name="dataStorageConfiguration">
>>>
>>>       <bean
>>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>>
>>>           <!-- Enable metrics for Ignite persistence  -->
>>>
>>>           <property name="metricsEnabled" value="true"/>
>>>
>>>           <property name="defaultDataRegionConfiguration">
>>>
>>>               <bean
>>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>>
>>>
>>>                   <property name="name" value="Default_Region"/>
>>>
>>>                   <property name="initialSize" value="#{32L * 1024 *
>>> 1024 * 1024}"/>
>>>
>>>                   <property name="maxSize" value="#{64L * 1024 * 1024 *
>>> 1024}"/>
>>>
>>>                   <!-- Enabling Apache Ignite Persistent Store. -->
>>>
>>>                   <property name="persistenceEnabled" value="true"/>
>>>
>>>                   <!-- Enable metrics for this data region  -->
>>>
>>>                   <property name="metricsEnabled" value="true"/>
>>>
>>>               </bean>
>>>
>>>           </property>
>>>
>>>           <property name="storagePath" value="/opt/ignite/persistence/"/>
>>>
>>>           <property name="walPath" value="/opt/ignite/wal/"/>
>>>
>>>       </bean>
>>>
>>>   </property>
>>>
>>>
>>> Ignite JVM configuration:  -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch
>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>>
>>>
>>> Thanks
>>>
>>> radha
>>>
>>>
>>>
>>>

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Reply via email to