Re: GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-12-04 Thread Ilya Kasnacheev
Hello!

It might be safe to remove that file and restart node.

Regards,
-- 
Ilya Kasnacheev


вт, 4 дек. 2018 г. в 01:35, Raymond Wilson :

> Hi Ilya,
>
> I check the state of the WAL file in question (0008.wal) - it
> is a zero byte WAL file. The only other WAL file present in the same
> location is .wal (65536kb in size), which seems odd as
> WALfiles 0001.wal through 0007.wal are not present.
>
> Thanks,
> Raymond.
>
> On Tue, Dec 4, 2018 at 6:03 AM Ilya Kasnacheev 
> wrote:
>
>> Hello!
>>
>> It seems that WAL file got truncated or something like that.
>>
>> Can you post this file to some file storage so that we could look?
>>
>> You can also try to change this node's WAL mode to LOG_ONLY and try to
>> start it again (after backing up data, of course). Checks are less strict
>> in this case.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 30 нояб. 2018 г. в 22:33, Raymond Wilson > >:
>>
>>> Hi Ilya,
>>>
>>> We don’t change the WAL segment size from the default values.
>>>
>>> The only activity that occurred was stopping a node, making a minor
>>> change (not related to persistence) and rerunning the node.
>>>
>>> Raymond.
>>>
>>> Sent from my iPhone
>>>
>>> On 1/12/2018, at 2:52 AM, Ilya Kasnacheev 
>>> wrote:
>>>
>>> Hello!
>>>
>>> "WAL segment size change is not supported"
>>>
>>> Is there a chance that you have changed WAL segment size setting between
>>> launches?
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> чт, 29 нояб. 2018 г. в 02:39, Raymond Wilson >> >:
>>>
 I'm using Ignite 2.6 with the C# client.

 I have a running cluster that I was debugging. All requests were read
 only (there were no state mutating operations running in the cluster.

 I terminated the one server node in the grid (running in the debugger)
 to make a small code change and re-run it (I do this frequently). The node
 may have been stopped for longer than the partitioning timeout.

 On re-running the server node it failed to start. On re-running the
 complete cluster it still failed to start, and all other nodes report
 failure to connect to a inactive grid.

 Looking at the log for the server node that is failing I get the
 following log showing an exception while initializing a WAL segment. This
 failure seems permanent and is unexpected as we are using the strict WAL
 atomicity mode (WalMode.Fsync) for all persisted regions.Is this a
 recoverable error, or does this imply data loss? [NB: This is a dev system
 so no prod data is affected]]


 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
   __  >>>   /  _/ ___/ |/ /  _/_  __/ __/
 >>>  _/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
  >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
 Software Foundation  >>>   >>> Ignite documentation:
 http://ignite.apache.org
 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config
 URL: n/a
 2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
 IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
 svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
 mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
 utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
 qryPoolSize=12, igniteHome=null,
 igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
 mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
 nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
 marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
 marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
 sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
 metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
 discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
 ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
 maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
 internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
 waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
 commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
 enableForcibleNodeKill=false, enableTroubleshootingLog=false,
 srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
 locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
 shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
 connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
 sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
 shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
 tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
 unackedMsgsBu

Re: GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-12-03 Thread Raymond Wilson
Hi Ilya,

I check the state of the WAL file in question (0008.wal) - it
is a zero byte WAL file. The only other WAL file present in the same
location is .wal (65536kb in size), which seems odd as
WALfiles 0001.wal through 0007.wal are not present.

Thanks,
Raymond.

On Tue, Dec 4, 2018 at 6:03 AM Ilya Kasnacheev 
wrote:

> Hello!
>
> It seems that WAL file got truncated or something like that.
>
> Can you post this file to some file storage so that we could look?
>
> You can also try to change this node's WAL mode to LOG_ONLY and try to
> start it again (after backing up data, of course). Checks are less strict
> in this case.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 30 нояб. 2018 г. в 22:33, Raymond Wilson :
>
>> Hi Ilya,
>>
>> We don’t change the WAL segment size from the default values.
>>
>> The only activity that occurred was stopping a node, making a minor
>> change (not related to persistence) and rerunning the node.
>>
>> Raymond.
>>
>> Sent from my iPhone
>>
>> On 1/12/2018, at 2:52 AM, Ilya Kasnacheev 
>> wrote:
>>
>> Hello!
>>
>> "WAL segment size change is not supported"
>>
>> Is there a chance that you have changed WAL segment size setting between
>> launches?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> чт, 29 нояб. 2018 г. в 02:39, Raymond Wilson > >:
>>
>>> I'm using Ignite 2.6 with the C# client.
>>>
>>> I have a running cluster that I was debugging. All requests were read
>>> only (there were no state mutating operations running in the cluster.
>>>
>>> I terminated the one server node in the grid (running in the debugger)
>>> to make a small code change and re-run it (I do this frequently). The node
>>> may have been stopped for longer than the partitioning timeout.
>>>
>>> On re-running the server node it failed to start. On re-running the
>>> complete cluster it still failed to start, and all other nodes report
>>> failure to connect to a inactive grid.
>>>
>>> Looking at the log for the server node that is failing I get the
>>> following log showing an exception while initializing a WAL segment. This
>>> failure seems permanent and is unexpected as we are using the strict WAL
>>> atomicity mode (WalMode.Fsync) for all persisted regions.Is this a
>>> recoverable error, or does this imply data loss? [NB: This is a dev system
>>> so no prod data is affected]]
>>>
>>>
>>> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
>>> __  >>>   /  _/ ___/ |/ /  _/_  __/ __/>>>
>>> _/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
>>>  >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
>>> Software Foundation  >>>   >>> Ignite documentation:
>>> http://ignite.apache.org
>>> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config
>>> URL: n/a
>>> 2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
>>> IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
>>> svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
>>> mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
>>> utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
>>> qryPoolSize=12, igniteHome=null,
>>> igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
>>> mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
>>> nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
>>> marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
>>> marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
>>> sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
>>> metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
>>> discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
>>> ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
>>> maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
>>> internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
>>> waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
>>> commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
>>> enableForcibleNodeKill=false, enableTroubleshootingLog=false,
>>> srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
>>> locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
>>> shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
>>> connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
>>> sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
>>> shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
>>> tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
>>> unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
>>> boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
>>> ctxInitLatch=java.util.concurrent.CountDownLatch@117e949d[Count = 1],
>>> stopping=false,
>>> metricsLsnr=org.apache.ignite.spi.co

Re: GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-12-03 Thread Ilya Kasnacheev
Hello!

It seems that WAL file got truncated or something like that.

Can you post this file to some file storage so that we could look?

You can also try to change this node's WAL mode to LOG_ONLY and try to
start it again (after backing up data, of course). Checks are less strict
in this case.

Regards,
-- 
Ilya Kasnacheev


пт, 30 нояб. 2018 г. в 22:33, Raymond Wilson :

> Hi Ilya,
>
> We don’t change the WAL segment size from the default values.
>
> The only activity that occurred was stopping a node, making a minor change
> (not related to persistence) and rerunning the node.
>
> Raymond.
>
> Sent from my iPhone
>
> On 1/12/2018, at 2:52 AM, Ilya Kasnacheev 
> wrote:
>
> Hello!
>
> "WAL segment size change is not supported"
>
> Is there a chance that you have changed WAL segment size setting between
> launches?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> чт, 29 нояб. 2018 г. в 02:39, Raymond Wilson :
>
>> I'm using Ignite 2.6 with the C# client.
>>
>> I have a running cluster that I was debugging. All requests were read
>> only (there were no state mutating operations running in the cluster.
>>
>> I terminated the one server node in the grid (running in the debugger) to
>> make a small code change and re-run it (I do this frequently). The node may
>> have been stopped for longer than the partitioning timeout.
>>
>> On re-running the server node it failed to start. On re-running the
>> complete cluster it still failed to start, and all other nodes report
>> failure to connect to a inactive grid.
>>
>> Looking at the log for the server node that is failing I get the
>> following log showing an exception while initializing a WAL segment. This
>> failure seems permanent and is unexpected as we are using the strict WAL
>> atomicity mode (WalMode.Fsync) for all persisted regions.Is this a
>> recoverable error, or does this imply data loss? [NB: This is a dev system
>> so no prod data is affected]]
>>
>>
>> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
>> __  >>>   /  _/ ___/ |/ /  _/_  __/ __/>>>
>> _/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
>>  >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
>> Software Foundation  >>>   >>> Ignite documentation:
>> http://ignite.apache.org
>> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config URL:
>> n/a
>> 2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
>> IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
>> svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
>> mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
>> utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
>> qryPoolSize=12, igniteHome=null,
>> igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
>> mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
>> nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
>> marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
>> marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
>> sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
>> metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
>> discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
>> ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
>> maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
>> internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
>> waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
>> commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
>> enableForcibleNodeKill=false, enableTroubleshootingLog=false,
>> srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
>> locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
>> shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
>> connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
>> sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
>> shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
>> tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
>> unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
>> boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
>> ctxInitLatch=java.util.concurrent.CountDownLatch@117e949d[Count = 1],
>> stopping=false,
>> metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@6db9f5a4],
>> evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@5f8edcc5,
>> colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
>> indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@7a675056,
>> addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=1,
>> txCfg=org.apache.ignite.configuration.TransactionConfiguration@d21a74c,
>> cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=S

Re: GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-11-30 Thread Raymond Wilson
Hi Ilya,

We don’t change the WAL segment size from the default values.

The only activity that occurred was stopping a node, making a minor change
(not related to persistence) and rerunning the node.

Raymond.

Sent from my iPhone

On 1/12/2018, at 2:52 AM, Ilya Kasnacheev  wrote:

Hello!

"WAL segment size change is not supported"

Is there a chance that you have changed WAL segment size setting between
launches?

Regards,
-- 
Ilya Kasnacheev


чт, 29 нояб. 2018 г. в 02:39, Raymond Wilson :

> I'm using Ignite 2.6 with the C# client.
>
> I have a running cluster that I was debugging. All requests were read only
> (there were no state mutating operations running in the cluster.
>
> I terminated the one server node in the grid (running in the debugger) to
> make a small code change and re-run it (I do this frequently). The node may
> have been stopped for longer than the partitioning timeout.
>
> On re-running the server node it failed to start. On re-running the
> complete cluster it still failed to start, and all other nodes report
> failure to connect to a inactive grid.
>
> Looking at the log for the server node that is failing I get the following
> log showing an exception while initializing a WAL segment. This failure
> seems permanent and is unexpected as we are using the strict WAL atomicity
> mode (WalMode.Fsync) for all persisted regions.Is this a recoverable error,
> or does this imply data loss? [NB: This is a dev system so no prod data is
> affected]]
>
>
> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
> __  >>>   /  _/ ___/ |/ /  _/_  __/ __/>>>
> _/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
>  >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
> Software Foundation  >>>   >>> Ignite documentation:
> http://ignite.apache.org
> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config URL:
> n/a
> 2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
> IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
> svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
> mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
> utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
> qryPoolSize=12, igniteHome=null,
> igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
> mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
> nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
> marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
> marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
> sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
> metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
> discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
> ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
> maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
> internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
> waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
> commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
> enableForcibleNodeKill=false, enableTroubleshootingLog=false,
> srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
> locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
> shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
> connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
> sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
> shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
> tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
> unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
> boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
> ctxInitLatch=java.util.concurrent.CountDownLatch@117e949d[Count = 1],
> stopping=false,
> metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@6db9f5a4],
> evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@5f8edcc5,
> colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
> indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@7a675056,
> addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=1,
> txCfg=org.apache.ignite.configuration.TransactionConfiguration@d21a74c,
> cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED,
> p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100,
> timeSrvPortRange=100, failureDetectionTimeout=1,
> clientFailureDetectionTimeout=3, metricsLogFreq=1, hadoopCfg=null,
> connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@6e509ffa,
> odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
> [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null,
> grpName=null], classLdr=null, sslCtxFactory=null,
> platformCfg=Platform

Re: GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-11-30 Thread Ilya Kasnacheev
Hello!

"WAL segment size change is not supported"

Is there a chance that you have changed WAL segment size setting between
launches?

Regards,
-- 
Ilya Kasnacheev


чт, 29 нояб. 2018 г. в 02:39, Raymond Wilson :

> I'm using Ignite 2.6 with the C# client.
>
> I have a running cluster that I was debugging. All requests were read only
> (there were no state mutating operations running in the cluster.
>
> I terminated the one server node in the grid (running in the debugger) to
> make a small code change and re-run it (I do this frequently). The node may
> have been stopped for longer than the partitioning timeout.
>
> On re-running the server node it failed to start. On re-running the
> complete cluster it still failed to start, and all other nodes report
> failure to connect to a inactive grid.
>
> Looking at the log for the server node that is failing I get the following
> log showing an exception while initializing a WAL segment. This failure
> seems permanent and is unexpected as we are using the strict WAL atomicity
> mode (WalMode.Fsync) for all persisted regions.Is this a recoverable error,
> or does this imply data loss? [NB: This is a dev system so no prod data is
> affected]]
>
>
> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
> __  >>>   /  _/ ___/ |/ /  _/_  __/ __/>>>
> _/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
>  >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
> Software Foundation  >>>   >>> Ignite documentation:
> http://ignite.apache.org
> 2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config URL:
> n/a
> 2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
> IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
> svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
> mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
> utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
> qryPoolSize=12, igniteHome=null,
> igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
> mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
> nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
> marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
> marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
> sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
> metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
> discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
> ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
> maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
> internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
> waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
> commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
> enableForcibleNodeKill=false, enableTroubleshootingLog=false,
> srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
> locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
> shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
> connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
> sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
> shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
> tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
> unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
> boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
> ctxInitLatch=java.util.concurrent.CountDownLatch@117e949d[Count = 1],
> stopping=false,
> metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@6db9f5a4],
> evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@5f8edcc5,
> colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
> indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@7a675056,
> addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=1,
> txCfg=org.apache.ignite.configuration.TransactionConfiguration@d21a74c,
> cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED,
> p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100,
> timeSrvPortRange=100, failureDetectionTimeout=1,
> clientFailureDetectionTimeout=3, metricsLogFreq=1, hadoopCfg=null,
> connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@6e509ffa,
> odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
> [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null,
> grpName=null], classLdr=null, sslCtxFactory=null,
> platformCfg=PlatformDotNetConfiguration [binaryCfg=null],
> binaryCfg=BinaryConfiguration [idMapper=null, nameMapper=null,
> serializer=null, compactFooter=true], memCfg=null, pstCfg=null,
> dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040,
> sysCacheMaxSize=104857600, pageSize=16384, concLvl=

GridProcessorAdapter fails to start due to failure to initialise WAL segment on Ignite startup

2018-11-28 Thread Raymond Wilson
I'm using Ignite 2.6 with the C# client.

I have a running cluster that I was debugging. All requests were read only
(there were no state mutating operations running in the cluster.

I terminated the one server node in the grid (running in the debugger) to
make a small code change and re-run it (I do this frequently). The node may
have been stopped for longer than the partitioning timeout.

On re-running the server node it failed to start. On re-running the
complete cluster it still failed to start, and all other nodes report
failure to connect to a inactive grid.

Looking at the log for the server node that is failing I get the following
log showing an exception while initializing a WAL segment. This failure
seems permanent and is unexpected as we are using the strict WAL atomicity
mode (WalMode.Fsync) for all persisted regions.Is this a recoverable error,
or does this imply data loss? [NB: This is a dev system so no prod data is
affected]]


2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer >>>
__  >>>   /  _/ ___/ |/ /  _/_  __/ __/>>>
_/ // (7 7// /  / / / _/  >>> /___/\___/_/|_/___/ /_/ /___/
 >>>   >>> ver. 2.6.0#20180710-sha1:669feacc  >>> 2018 Copyright(C) Apache
Software Foundation  >>>   >>> Ignite documentation:
http://ignite.apache.org
2018-11-29 12:26:09,933 [1] INFO  ImmutableCacheComputeServer Config URL:
n/a
2018-11-29 12:26:09,948 [1] INFO  ImmutableCacheComputeServer
IgniteConfiguration [igniteInstanceName=TRex-Immutable, pubPoolSize=50,
svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, sysPoolSize=12,
mgmtPoolSize=4, igfsPoolSize=12, dataStreamerPoolSize=12,
utilityCachePoolSize=12, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
qryPoolSize=12, igniteHome=null,
igniteWorkDir=C:\Users\rwilson\AppData\Local\Temp\TRexIgniteData\Immutable,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e4784bc,
nodeId=8f32d0a6-539c-40dd-bc42-d044f28bac73,
marsh=org.apache.ignite.internal.binary.BinaryMarshaller@e4487af,
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=1,
metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=5000,
ackTimeout=5000, marsh=null, reconCnt=10, reconDelay=2000,
maxAckTimeout=60, forceSrvMode=false, clientReconnectDisabled=false,
internalLsnr=null], segPlc=STOP, segResolveAttempts=2,
waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=1,
commSpi=TcpCommunicationSpi [connectGate=null, connPlc=null,
enableForcibleNodeKill=false, enableTroubleshootingLog=false,
srvLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$2@10d68fcd,
locAddr=127.0.0.1, locHost=null, locPort=47100, locPortRange=100,
shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=3,
connTimeout=5000, maxConnTimeout=60, reconCnt=10, sockSndBuf=32768,
sockRcvBuf=32768, msgQueueLimit=1024, slowClientQueueLimit=0, nioSrvr=null,
shmemSrv=null, usePairedConnections=false, connectionsPerNode=1,
tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=16,
unackedMsgsBufSize=0, sockWriteTimeout=2000, lsnr=null, boundTcpPort=-1,
boundTcpShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@117e949d[Count = 1],
stopping=false,
metricsLsnr=org.apache.ignite.spi.communication.tcp.TcpCommunicationMetricsListener@6db9f5a4],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@5f8edcc5,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [lsnr=null],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@7a675056,
addrRslvr=null, clientMode=false, rebalanceThreadPoolSize=1,
txCfg=org.apache.ignite.configuration.TransactionConfiguration@d21a74c,
cacheSanityCheckEnabled=true, discoStartupDelay=6, deployMode=SHARED,
p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100,
timeSrvPortRange=100, failureDetectionTimeout=1,
clientFailureDetectionTimeout=3, metricsLogFreq=1, hadoopCfg=null,
connectorCfg=org.apache.ignite.configuration.ConnectorConfiguration@6e509ffa,
odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
[seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null,
grpName=null], classLdr=null, sslCtxFactory=null,
platformCfg=PlatformDotNetConfiguration [binaryCfg=null],
binaryCfg=BinaryConfiguration [idMapper=null, nameMapper=null,
serializer=null, compactFooter=true], memCfg=null, pstCfg=null,
dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040,
sysCacheMaxSize=104857600, pageSize=16384, concLvl=0,
dfltDataRegConf=DataRegionConfiguration [name=Default-Immutable,
maxSize=1073741824, initSize=134217728, swapPath=null,
pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100,
metricsEnabled=false, metricsSubIntervalCount=5,
metricsRateTimeInterval=6, persistenceEnabled=true,
checkpointPageBufSize=0],
storagePath=/persist\TRexIgniteData\Imm