Kubernetes discovery with readinessProbe

2018-02-13 Thread Bryan Rosander
We're using Apache Ignite in Kubernetes as a caching layer right now and
it's working well for us so far.

One thing that's been problematic is that when defining a readinessProbe,
the TcpDiscoveryKubernetesIpFinder only sees pods already in a ready state.

This means that if you start all your pods at once (as part of deployment
creation) you'll wind up with each pod being in its own grid.

It seems like others are experiencing this problem and I think I have a
workaround:
https://stackoverflow.com/questions/45176143/ignite-readinessprobe/48773865#48773865

The basic premise is to kill pods after they connect to a grid that doesn't
contain the alphabetical first ip in the service list.

I was wondering if there is a better solution to this segmentation problem
or if this seems workable.

Thanks,
Bryan


SSL Exception

2018-02-27 Thread Bryan Rosander
We're using ignite in a 3 node grid with SSL just hit an issue where after
a period of time (hours after starting), 2 of the 3 nodes seem to have lost
connectivity and we see the following stack trace over and over.

The cluster starts up fine so I doubt it's an issue with the certificates
or keystores.  Also bouncing the ignite instances seems to have "fixed"
it.  Any ideas as to what could have happened?

Thanks,
Bryan

2018-02-27 14:52:36,071 INFO  [grid-nio-worker-tcp-comm-2-#27]
o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication
connection [locAddr=/100.96.3.72:47100, rmtAddr=/100.96.6.183:45484]
2018-02-27 14:52:37,072 ERROR [grid-nio-worker-tcp-comm-2-#27]
o.a.i.s.c.tcp.TcpCommunicationSpi - Failed to process selector key
[ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker
[super=AbstractNioClientWorker [idx=2, bytesRcvd=17479234, bytesSent=0,
bytesRcvd0=2536, bytesSent0=0, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
hashCode=1854311052, interrupted=false,
runner=grid-nio-worker-tcp-comm-2-#27]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
100.96.3.72:47100, rmtAddr=/100.96.6.183:45484, createTime=1519743156030,
closeTime=0, bytesSent=2448, bytesRcvd=2536, bytesSent0=2448,
bytesRcvd0=2536, sndSchedTime=1519743156071, lastSndTime=1519743156071,
lastRcvTime=1519743156071, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@497350a6, directMode=true],
GridConnectionBytesVerifyFilter, SSL filter], accepted=true]]]
javax.net.ssl.SSLException: Failed to encrypt data (SSL engine error)
[status=CLOSED, handshakeStatus=NEED_UNWRAP, ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2,
igniteInstanceName=null, finished=false, hashCode=1854311052,
interrupted=false, runner=grid-nio-worker-tcp-comm-2-#27]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
100.96.3.72:47100, rmtAddr=/100.96.6.183:45484, createTime=1519743156030,
closeTime=0, bytesSent=2448, bytesRcvd=2536, bytesSent0=2448,
bytesRcvd0=2536, sndSchedTime=1519743156071, lastSndTime=1519743156071,
lastRcvTime=1519743156071, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=org.apache.ignite.internal.util.nio.GridDirectParser@497350a6,
directMode=true], GridConnectionBytesVerifyFilter, SSL filter],
accepted=true]]]
at
org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.encrypt(GridNioSslHandler.java:379)
at
org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.encrypt(GridNioSslFilter.java:270)
at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWriteSsl(GridNioServer.java:1418)
at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite(GridNioServer.java:1287)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2275)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2048)
at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1717)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:748)
2018-02-27 14:52:37,072 WARN  [grid-nio-worker-tcp-comm-2-#27]
o.a.i.s.c.tcp.TcpCommunicationSpi - Closing NIO session because of
unhandled exception [cls=class o.a.i.i.util.nio.GridNioException,
msg=Failed to encrypt data (SSL engine error) [status=CLOSED,
handshakeStatus=NEED_UNWRAP, ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2,
igniteInstanceName=null, finished=false, hashCode=1854311052,
interrupted=false, runner=grid-nio-worker-tcp-comm-2-#27]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
100.96.3.72:47100, rmtAddr=/100.96.6.183:45484, createTime=1519743156030,
closeTime=0, bytesSent=2448, bytesRcvd=2536, bytesSent0=2448,
bytesRcvd0=2536, sndSchedTime=1519743156071, lastSndTime=1519743156071,
lastRcvTime=1519743156071, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParse

Re: SSL Exception

2018-02-27 Thread Bryan Rosander
Also, this is ignite 2.3.0, please let me know if there's any more
information I can provide.

On Tue, Feb 27, 2018 at 9:59 AM, Bryan Rosander 
wrote:

> We're using ignite in a 3 node grid with SSL just hit an issue where after
> a period of time (hours after starting), 2 of the 3 nodes seem to have lost
> connectivity and we see the following stack trace over and over.
>
> The cluster starts up fine so I doubt it's an issue with the certificates
> or keystores.  Also bouncing the ignite instances seems to have "fixed"
> it.  Any ideas as to what could have happened?
>
> Thanks,
> Bryan
>
> 2018-02-27 14:52:36,071 INFO  [grid-nio-worker-tcp-comm-2-#27]
> o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication
> connection [locAddr=/100.96.3.72:47100, rmtAddr=/100.96.6.183:45484]
> 2018-02-27 14:52:37,072 ERROR [grid-nio-worker-tcp-comm-2-#27]
> o.a.i.s.c.tcp.TcpCommunicationSpi - Failed to process selector key 
> [ses=GridSelectorNioSessionImpl
> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
> bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
> select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2,
> igniteInstanceName=null, finished=false, hashCode=1854311052,
> interrupted=false, runner=grid-nio-worker-tcp-comm-2-#27]]],
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
> 100.96.3.72:47100, rmtAddr=/100.96.6.183:45484, createTime=1519743156030,
> closeTime=0, bytesSent=2448, bytesRcvd=2536, bytesSent0=2448,
> bytesRcvd0=2536, sndSchedTime=1519743156071, lastSndTime=1519743156071,
> lastRcvTime=1519743156071, readsPaused=false, 
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=o.a.i.i.util.nio.GridDirectParser@497350a6, directMode=true],
> GridConnectionBytesVerifyFilter, SSL filter], accepted=true]]]
> javax.net.ssl.SSLException: Failed to encrypt data (SSL engine error)
> [status=CLOSED, handshakeStatus=NEED_UNWRAP, ses=GridSelectorNioSessionImpl
> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
> bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
> select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2,
> igniteInstanceName=null, finished=false, hashCode=1854311052,
> interrupted=false, runner=grid-nio-worker-tcp-comm-2-#27]]],
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
> 100.96.3.72:47100, rmtAddr=/100.96.6.183:45484, createTime=1519743156030,
> closeTime=0, bytesSent=2448, bytesRcvd=2536, bytesSent0=2448,
> bytesRcvd0=2536, sndSchedTime=1519743156071, lastSndTime=1519743156071,
> lastRcvTime=1519743156071, readsPaused=false, 
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=org.apache.ignite.internal.util.nio.GridDirectParser@497350a6,
> directMode=true], GridConnectionBytesVerifyFilter, SSL filter],
> accepted=true]]]
> at org.apache.ignite.internal.util.nio.ssl.
> GridNioSslHandler.encrypt(GridNioSslHandler.java:379)
> at org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.
> encrypt(GridNioSslFilter.java:270)
> at org.apache.ignite.internal.util.nio.GridNioServer$
> DirectNioClientWorker.processWriteSsl(GridNioServer.java:1418)
> at org.apache.ignite.internal.util.nio.GridNioServer$
> DirectNioClientWorker.processWrite(GridNioServer.java:1287)
> at org.apache.ignite.internal.util.nio.GridNioServer$
> AbstractNioClientWorker.processSelectedKeysOptimized(
> GridNioServer.java:2275)
> at org.apache.ignite.internal.util.nio.GridNioServer$
> AbstractNioClientWorker.bodyInternal(GridNioServer.java:2048)
> at org.apache.ignite.internal.util.nio.GridNioServer$
> AbstractNioClientWorker.body(GridNioServer.java:1717)
> at org.apache.ignite.internal.util.worker.GridWorker.run(
> GridWorker.java:110)
> at java.lang.Thread.run(Thread.java:748)
> 2018-02-27 14:52:37,072 WARN  [grid-nio-worker-tcp-comm-2-#27]
> o.a.i.s.c.tcp.TcpCommunicationSpi - Closing NIO session because of
> unhandled exception [cls=class o.a.i.i.util.nio.GridNioException,
> msg=Failed to encrypt data (SSL engine error) [status=CLOSED,
> handshakeStatus=NEED_UNWRAP, ses=GridSelectorNioSessionImpl
> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
> bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
> select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-2,
> igniteInstanceName=null, finished=false, hashCode=1854311052,
> interrupted=false, r

Re: SSL Exception

2018-02-27 Thread Bryan Rosander
Hi Ilya,

It looks like that error corresponds to restarts of the particular pods
we're running.  We're currently running in Kubernetes as a stateful set.

I think it has to do with the node coming back up with the same address and
hostname but a different identifier.  I see this in the logs:
Caused by:
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$HandshakeException:
Remote node ID is not as expected
[expected=704fa7c2-bb6a-44bb-89c6-06722a3abac8,
rcvd=922d993c-6b08-4bee-92f2-130e108e3657]

After manually setting consistentId in the configuration, it seems that I
can bounce the pods at will without hitting this issue. I'll follow up if
we see it again.

Thanks,
Bryan

On Tue, Feb 27, 2018 at 11:14 AM, Ilya Kasnacheev  wrote:

> Hello Bryan!
>
> 2nd attempt to send this mail.
>
>
> Can you search in the log prior to the first problematic "Accepted
> incoming communication connection"? I assume there was a communication
> connection already back when the node was started, and you should look why
> it was closed in the first place. That might provide clues.
>
> Also, logs from remote node (one that makes those connections) at the same
> time might provide clues.
>
> Don't hesitate to provide full node logs.
>
> Regards,
>
>
> --
> Ilya Kasnacheev
>
> 2018-02-27 18:32 GMT+03:00 Ilya Kasnacheev :
>
>> Hello Bryan!
>>
>> Can you search in the log prior to the first problematic "Accepted
>> incoming communication connection"? I assume there was a communication
>> connection already back when the node was started, and you should look why
>> it was closed in the first place. That might provide clues.
>>
>> Also, logs from remote node (one that makes those connections) at the
>> same time might provide clues.
>>
>> Don't hesitate to provide full node logs.
>>
>> Regards,
>>
>> --
>> Ilya Kasnacheev
>>
>> 2018-02-27 18:03 GMT+03:00 Bryan Rosander :
>>
>>> Also, this is ignite 2.3.0, please let me know if there's any more
>>> information I can provide.
>>>
>>> On Tue, Feb 27, 2018 at 9:59 AM, Bryan Rosander >> > wrote:
>>>
>>>> We're using ignite in a 3 node grid with SSL just hit an issue where
>>>> after a period of time (hours after starting), 2 of the 3 nodes seem to
>>>> have lost connectivity and we see the following stack trace over and over.
>>>>
>>>> The cluster starts up fine so I doubt it's an issue with the
>>>> certificates or keystores.  Also bouncing the ignite instances seems to
>>>> have "fixed" it.  Any ideas as to what could have happened?
>>>>
>>>> Thanks,
>>>> Bryan
>>>>
>>>> 2018-02-27 14:52:36,071 INFO  [grid-nio-worker-tcp-comm-2-#27]
>>>> o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication
>>>> connection [locAddr=/100.96.3.72:47100, rmtAddr=/100.96.6.183:45484]
>>>> 2018-02-27 14:52:37,072 ERROR [grid-nio-worker-tcp-comm-2-#27]
>>>> o.a.i.s.c.tcp.TcpCommunicationSpi - Failed to process selector key
>>>> [ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker
>>>> [super=AbstractNioClientWorker [idx=2, bytesRcvd=17479234, bytesSent=0,
>>>> bytesRcvd0=2536, bytesSent0=0, select=true, super=GridWorker
>>>> [name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null,
>>>> finished=false, hashCode=1854311052, interrupted=false,
>>>> runner=grid-nio-worker-tcp-comm-2-#27]]],
>>>> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
>>>> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
>>>> inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/
>>>> 100.96.3.72:47100, rmtAddr=/100.96.6.183:45484,
>>>> createTime=1519743156030, closeTime=0, bytesSent=2448, bytesRcvd=2536,
>>>> bytesSent0=2448, bytesRcvd0=2536, sndSchedTime=1519743156071,
>>>> lastSndTime=1519743156071, lastRcvTime=1519743156071, readsPaused=false,
>>>> filterChain=FilterChain[filters=[GridNioCodecFilter
>>>> [parser=o.a.i.i.util.nio.GridDirectParser@497350a6, directMode=true],
>>>> GridConnectionBytesVerifyFilter, SSL filter], accepted=true]]]
>>>> javax.net.ssl.SSLException: Failed to encrypt data (SSL engine error)
>>>> [status=CLOSED, handshakeStatus=NEED_UNWRAP, ses=GridSelectorNioSessionImpl
>>>> [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
>>>> bytesRcvd=17479234, bytesSent=0, bytesRcvd0=2536, bytesSent0=0,
&

NPE When joining grid

2018-06-27 Thread Bryan Rosander
Hey all,

I was wondering if anyone else has seen NPEs while joining a grid w/ Ignite
2.4.0 (a quick search didn't show anything in Jira)

This is happening in our K8s cluster where the grid is rolled for every CI
deploy.

2018-06-27 18:00:59,869 INFO  [exchange-worker-#42]
o.a.ignite.internal.exchange.time - Finished exchange init
[topVer=AffinityTopologyVersion [topVer=264, minorTopVer=0], crd=false]
2018-06-27 18:01:00,072 INFO  [sys-#44]
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Received full message,
will finish exchange [node=7dabcf2e-f30f-43e7-8364-32760205c3f1,
resVer=AffinityTopologyVersion [topVer=265, minorTopVer=0]]
2018-06-27 18:01:00,075 ERROR [sys-#44]
o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Failed to notify
listener:
o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$5@be0660
java.lang.NullPointerException: null
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$13.applyx(CacheAffinitySharedManager.java:1335)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$13.applyx(CacheAffinitySharedManager.java:1327)
at
org.apache.ignite.internal.util.lang.IgniteInClosureX.apply(IgniteInClosureX.java:38)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllCacheGroups(CacheAffinitySharedManager.java:1115)
at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1327)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:2941)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$1400(GridDhtPartitionsExchangeFuture.java:124)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$5.apply(GridDhtPartitionsExchangeFuture.java:2684)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$5.apply(GridDhtPartitionsExchangeFuture.java:2672)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveFullMessage(GridDhtPartitionsExchangeFuture.java:2672)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processFullPartitionUpdate(GridCachePartitionExchangeManager.java:1481)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1100(GridCachePartitionExchangeManager.java:133)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$3.onMessage(GridCachePartitionExchangeManager.java:339)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$3.onMessage(GridCachePartitionExchangeManager.java:337)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2689)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2668)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1060)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:378)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:304)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:99)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:293)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1555)
at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1183)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:126)
at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1090)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Thanks,
Bryan