failureDetectionTimeout - 60000 joinTimeout - 120000 Saw these recomendations in one of the answers in your forum
On Mon, Jan 14, 2019 at 2:21 PM Stephen Darlington < stephen.darling...@gridgain.com> wrote: > Glad you managed to resolve it. What did you have to increase the values > to? > > Regards, > Stephen > > On 14 Jan 2019, at 09:34, Alena Laas <alena.l...@cbsinteractive.com> > wrote: > > It seems that increasing joinTimeout and failureDetectionTimeout solved > the problem. > > On Fri, Jan 11, 2019 at 5:24 PM Alena Laas <alena.l...@cbsinteractive.com> > wrote: > >> I attached part of the log with "node failed" events (100.99.129.141 - ip >> of restarted node) >> >> These events are repeated until suddenly after about 40 min - an hour >> node is connected to cluster. >> >> Could you explain why this is happening? >> >> On Thu, Jan 10, 2019 at 7:54 PM Alena Laas <alena.l...@cbsinteractive.com> >> wrote: >> >>> We are using Azure AKS cluster. >>> >>> We kill pod using Kubernetes dashboard or through kubectl (kubectl >>> delete pods <name>), never mind, result is the same. >>> >>> Maybe you need some more logs from us? >>> >>> On Thu, Jan 10, 2019 at 7:28 PM Stephen Darlington < >>> stephen.darling...@gridgain.com> wrote: >>> >>>> What kind of environment are you using? A public cloud? Your own data >>>> centre? And how are you killing the pod? >>>> >>>> I fired up a cluster using Minikube and your configuration and it >>>> worked as far as I could see. (I deleted the pod using the dashboard, for >>>> what that’s worth.) >>>> >>>> Regards, >>>> Stephen >>>> >>>> On 10 Jan 2019, at 14:20, Alena Laas <alena.l...@cbsinteractive.com> >>>> wrote: >>>> >>>> >>>> >>>> ---------- Forwarded message --------- >>>> From: Alena Laas <alena.l...@cbsinteractive.com> >>>> Date: Thu, Jan 10, 2019 at 5:13 PM >>>> Subject: Ignite in Kubernetes not works correctly >>>> To: <user@ignite.apache.org> >>>> Cc: Vadim Shcherbakov <vadim.shcherba...@cbsinteractive.com> >>>> >>>> >>>> Hello! >>>> Could you please help with some problem with Ignite within Kubernetes >>>> cluster? >>>> >>>> When we start 2 Ignite nodes at the same time or use scaling for >>>> Deployment (from 1 to 2) everything is fine, both of them are visible >>>> inside Ignite cluster (we use web console to see it) >>>> >>>> But after we kill pod with one node and it restarts the node is no more >>>> seen in Ignite cluster. Moreover the logs from this restarted node look >>>> poor: >>>> [13:32:57] __________ ________________ >>>> [13:32:57] / _/ ___/ |/ / _/_ __/ __/ >>>> [13:32:57] _/ // (7 7 // / / / / _/ >>>> [13:32:57] /___/\___/_/|_/___/ /_/ /___/ >>>> [13:32:57] >>>> [13:32:57] ver. 2.7.0#20181130-sha1:256ae401 >>>> [13:32:57] 2018 Copyright(C) Apache Software Foundation >>>> [13:32:57] >>>> [13:32:57] Ignite documentation: http://ignite.apache.org >>>> [13:32:57] >>>> [13:32:57] Quiet mode. >>>> [13:32:57] ^-- Logging to file >>>> '/opt/ignite/apache-ignite/work/log/ignite-7d323675.0.log' >>>> [13:32:57] ^-- Logging by 'JavaLogger [quiet=true, config=null]' >>>> [13:32:57] ^-- To see **FULL** console log here add >>>> -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat} >>>> [13:32:57] >>>> [13:32:57] OS: Linux 4.15.0-1036-azure amd64 >>>> [13:32:57] VM information: OpenJDK Runtime Environment 1.8.0_181-b13 >>>> Oracle Corporation OpenJDK 64-Bit Server VM 25.181-b13 >>>> [13:32:57] Please set system property '-Djava.net.preferIPv4Stack=true' >>>> to avoid possible problems in mixed environments. >>>> [13:32:57] Configured plugins: >>>> [13:32:57] ^-- None >>>> [13:32:57] >>>> [13:32:57] Configured failure handler: >>>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, >>>> super=AbstractFailureHandler >>>> [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]]] >>>> [13:32:58] Message queue limit is set to 0 which may lead to potential >>>> OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due >>>> to message queues growth on sender and receiver sides. >>>> [13:32:58] Security status [authentication=off, tls/ssl=off] >>>> >>>> And logs from the remaining node say that there are either 2 or 1 >>>> server and this info is blinking >>>> [14:02:05] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:02:15] Topology snapshot [ver=234, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:02:15] Topology snapshot [ver=235, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:02:20] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:02:30] Topology snapshot [ver=236, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:02:30] Topology snapshot [ver=237, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:02:35] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:02:45] Topology snapshot [ver=238, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:02:45] Topology snapshot [ver=239, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:02:50] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:03:00] Topology snapshot [ver=240, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:03:00] Topology snapshot [ver=241, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:03:06] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:03:16] Topology snapshot [ver=242, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:03:16] Topology snapshot [ver=243, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:03:21] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:03:31] Topology snapshot [ver=244, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:03:31] Topology snapshot [ver=245, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:03:36] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:03:46] Topology snapshot [ver=246, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:03:46] Topology snapshot [ver=247, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:03:51] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> [14:04:01] Topology snapshot [ver=248, locNode=a5eb30e1, servers=2, >>>> clients=0, state=ACTIVE, CPUs=16, offheap=40.0GB, heap=2.0GB] >>>> [14:04:01] Topology snapshot [ver=249, locNode=a5eb30e1, servers=1, >>>> clients=0, state=ACTIVE, CPUs=8, offheap=20.0GB, heap=1.0GB] >>>> [14:04:06] Joining node doesn't have encryption data >>>> [node=7d323675-bc0b-4507-affb-672b25766201] >>>> >>>> I am attaching our config file for Ignite server and yaml files for >>>> Kubernetes. Everything there was done according to your official >>>> documentation. Ignite version we are trying now is 2.7.0 >>>> Looking forward to getting an answer from you. >>>> >>>> -- >>>> >>>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA) >>>> CNET Content Solutions >>>> OFFICE +7.495.967.1201 FAX +7.495.967.1203 >>>> 5 Letnikovskaya str., Moscow, Russia, 115114 >>>> [image: CNET Content Solutions] >>>> >>>> >>>> -- >>>> >>>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA) >>>> CNET Content Solutions >>>> OFFICE +7.495.967.1201 FAX +7.495.967.1203 >>>> 5 Letnikovskaya str., Moscow, Russia, 115114 >>>> [image: CNET Content Solutions] >>>> <ignite-config-server.xml><fcat-ignite-stage.yaml> >>>> >>>> >>>> >>>> >>> >>> -- >>> >>> *ALENA LAAS*SOFTWARE ENGINEER (JAVA) >>> CNET Content Solutions >>> OFFICE +7.495.967.1201 FAX +7.495.967.1203 >>> 5 Letnikovskaya str., Moscow, Russia, 115114 >>> [image: CNET Content Solutions] >>> >> >> >> -- >> >> *ALENA LAAS*SOFTWARE ENGINEER (JAVA) >> CNET Content Solutions >> OFFICE +7.495.967.1201 FAX +7.495.967.1203 >> 5 Letnikovskaya str., Moscow, Russia, 115114 >> [image: CNET Content Solutions] >> > > > -- > > *ALENA LAAS*SOFTWARE ENGINEER (JAVA) > CNET Content Solutions > OFFICE +7.495.967.1201 FAX +7.495.967.1203 > 5 Letnikovskaya str., Moscow, Russia, 115114 > [image: CNET Content Solutions] > > > > -- *ALENA LAAS*SOFTWARE ENGINEER (JAVA) CNET Content Solutions OFFICE +7.495.967.1201 FAX +7.495.967.1203 5 Letnikovskaya str., Moscow, Russia, 115114 [image: CNET Content Solutions]