Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up
If backup partitions are available when a node is lost, we should not expect lost partitions. There is a lot more to this story than this thread explains, so for the community: please don't follow this procedure. https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy "A partition is lost when both the primary copy and all backup copies of the partition are not available to the cluster, i.e. when the primary and backup nodes for the partition become unavailable." If you attempt to access a cache and receive a lost partitions error, this means there IS DATA LOSS. Partition loss means there are no primary or backup copies of a particular cache partition available. Have multiple server nodes experienced trouble? Can we be certain that the affected caches were created with backups>=1? If a node fails to start up, and complains about maintenance tasks, we should be very suspicious this node's persistent data is corrupted. If the cluster is activated with a missing node and caches have lost partitions, then we know these caches have lost some data. If there are no lost partitions, we can safely remove the corrupted node from the baseline and bring up a fresh node, and add it to the baseline to replace it thus restoring redundancy. If there are lost partitions and we need to reset lost partitions to bring a cache back online, we should expect that cache is missing some data and may need to be reloaded. Cache configuration backups=2 is excessive except in edge cases. For backups=n, the memory and persistence footprint cost is n+1 times the nominal data footprint. This scales linear. The marginal utility we derive from each additional backup copy is diminishing because for a probability of any single node failure p or p/1, the likelihood of needing those extra copies is p/(n+1) for n backup copies. Think of backup partitions like multiple coats of paint. After the second coat, nobody will be able to tell the difference if you applied a third or fourth coat. It still takes the same effort and materials to apply each coat of paint. If you NEED fault tolerance, then it should be mandatory to conduct testing to make sure the configuration you have chosen is working as expected. If backups=1 isn't effective for single node failures, then backups=2 will make no beneficial difference. With backups=1 we should expect a cache to work without complaining about lost partitions when one server node is offline. On Wed, May 29, 2024 at 12:15 PM Naveen Kumar wrote: > Thanks very much for your prompt response Gianluca > > just for the community, I could solve this by running the control.sh with > reset lost partitions for individual cachereset_lost_partitions > looks like it worked, those partition issue is resolved, I suppose there > wouldnt be any data loss as we have set all our caches with 2 replicas > > coming to the node which was not getting added to the cluster earlier, > removed from baseline --> cleared all persistence store --> brought up the > node --> added the node to baseline, this also seems to have worked fine. > > Thanks > > > On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti < > gianluca.bone...@gmail.com> wrote: > >> Hello Naveen >> >> Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact. >> Three bugfix releases had been rolled out over time up to 2.16 release. >> >> It seems you are restarting your cluster on a regular basis, so you'd >> better upgrade to 2.16 as soon as possible. >> Otherwise it will also be very difficult for people on a community based >> mailing list, on volunteer time, to work out a solution with a 2 years old >> version running. >> >> Besides that, you are not providing very much information about your >> cluster setup. >> How many nodes, what infrastructure, how many caches, overall data size. >> One could only guess you have more than 1 node running, with at least 1 >> cache, and non-empty dataset. :) >> >> This document from GridGain may be helpful but I don't see the same for >> Ignite, it may still be worth checking it out. >> >> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode >> >> On the other hand you should also check your failing node. >> If it is always the same node failing, then there should be some root >> cause apart from Ignite. >> Indeed if the nodes configuration is the same across all nodes, and just >> this one fails, you should also consider some network issues (check >> connectivity and network latency between nodes) and hardware related issues >> (faulty disks, faulty memory) >> In the end, one option might be to replace the faulty machine with a >> brand new one. >> In cloud environments this is actually quite cheap and easy to do. >> >> Cheers >> Gianluca >> >> On Wed, 29 May 2024 at 08:43, Naveen Kumar >> wrote: >> >>> Hello All >>> >>> We are using Ignite 2.13.0 >>> >>> After a cluster restart, one of the node is not coming up and in node >>> logs are seeing this error - Nod
Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up
Thanks very much for your prompt response Gianluca just for the community, I could solve this by running the control.sh with reset lost partitions for individual cachereset_lost_partitions looks like it worked, those partition issue is resolved, I suppose there wouldnt be any data loss as we have set all our caches with 2 replicas coming to the node which was not getting added to the cluster earlier, removed from baseline --> cleared all persistence store --> brought up the node --> added the node to baseline, this also seems to have worked fine. Thanks On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti wrote: > Hello Naveen > > Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact. > Three bugfix releases had been rolled out over time up to 2.16 release. > > It seems you are restarting your cluster on a regular basis, so you'd > better upgrade to 2.16 as soon as possible. > Otherwise it will also be very difficult for people on a community based > mailing list, on volunteer time, to work out a solution with a 2 years old > version running. > > Besides that, you are not providing very much information about your > cluster setup. > How many nodes, what infrastructure, how many caches, overall data size. > One could only guess you have more than 1 node running, with at least 1 > cache, and non-empty dataset. :) > > This document from GridGain may be helpful but I don't see the same for > Ignite, it may still be worth checking it out. > > https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode > > On the other hand you should also check your failing node. > If it is always the same node failing, then there should be some root > cause apart from Ignite. > Indeed if the nodes configuration is the same across all nodes, and just > this one fails, you should also consider some network issues (check > connectivity and network latency between nodes) and hardware related issues > (faulty disks, faulty memory) > In the end, one option might be to replace the faulty machine with a brand > new one. > In cloud environments this is actually quite cheap and easy to do. > > Cheers > Gianluca > > On Wed, 29 May 2024 at 08:43, Naveen Kumar > wrote: > >> Hello All >> >> We are using Ignite 2.13.0 >> >> After a cluster restart, one of the node is not coming up and in node >> logs are seeing this error - Node requires maintenance, non-empty set of >> maintainance tasks is found - node is not coming up >> >> we are getting errors like time out is reached before computation is >> completed error in other nodes as well. >> >> I could see that, we have control.sh script to backup and clean up the >> corrupted files, but when I run the command, it fails. >> >> I have removed the node from baseline and tried to run as well, still its >> failing >> >> what could be the solution for this, cluster is functioning, >> however there are requests failing >> >> Is there anyway we can start ignite node in maintenance mode and try >> running clean corrupted commands >> >> Thanks >> Naveen >> >> >> -- Thanks & Regards, Naveen Bandaru
Apache Ignite - Load balancer nodes standalone
I installed 4 apache ignite nodes on 4 virtual machines that are in a vmware cluster, one question I have is how do I connect to the cluster? Do I connect via the IP of any node? Or do I need to put nginx "in front" to do the load balancing? I didn't find anything about this in the official documentation. Eduardo Henrique Paludo Analista de Redes e Comunicação de Dados (49) 3361-6661 - Ramal: 8074 Matriz Chapecó - SC Visite nosso site: www.expressosaomiguel.com.br [http://www.expressosaomiguel.com.br/logotipo_email/V3.png] O conteúdo deste e-mail é confidencial e destinado exclusivamente ao destinatário especificado apenas na mensagem. É estritamente proibido compartilhar qualquer parte desta mensagem com terceiros, sem o consentimento por escrito do remetente. Se você recebeu esta mensagem por engano, responda a esta mensagem e siga com sua exclusão, para que possamos garantir que tal erro não ocorra no futuro.
Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up
Hello Naveen Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact. Three bugfix releases had been rolled out over time up to 2.16 release. It seems you are restarting your cluster on a regular basis, so you'd better upgrade to 2.16 as soon as possible. Otherwise it will also be very difficult for people on a community based mailing list, on volunteer time, to work out a solution with a 2 years old version running. Besides that, you are not providing very much information about your cluster setup. How many nodes, what infrastructure, how many caches, overall data size. One could only guess you have more than 1 node running, with at least 1 cache, and non-empty dataset. :) This document from GridGain may be helpful but I don't see the same for Ignite, it may still be worth checking it out. https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode On the other hand you should also check your failing node. If it is always the same node failing, then there should be some root cause apart from Ignite. Indeed if the nodes configuration is the same across all nodes, and just this one fails, you should also consider some network issues (check connectivity and network latency between nodes) and hardware related issues (faulty disks, faulty memory) In the end, one option might be to replace the faulty machine with a brand new one. In cloud environments this is actually quite cheap and easy to do. Cheers Gianluca On Wed, 29 May 2024 at 08:43, Naveen Kumar wrote: > Hello All > > We are using Ignite 2.13.0 > > After a cluster restart, one of the node is not coming up and in node logs > are seeing this error - Node requires maintenance, non-empty set of > maintainance tasks is found - node is not coming up > > we are getting errors like time out is reached before computation is > completed error in other nodes as well. > > I could see that, we have control.sh script to backup and clean up the > corrupted files, but when I run the command, it fails. > > I have removed the node from baseline and tried to run as well, still its > failing > > what could be the solution for this, cluster is functioning, however there > are requests failing > > Is there anyway we can start ignite node in maintenance mode and try > running clean corrupted commands > > Thanks > Naveen > > >
Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up
Hello All We are using Ignite 2.13.0 After a cluster restart, one of the node is not coming up and in node logs are seeing this error - Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up we are getting errors like time out is reached before computation is completed error in other nodes as well. I could see that, we have control.sh script to backup and clean up the corrupted files, but when I run the command, it fails. I have removed the node from baseline and tried to run as well, still its failing what could be the solution for this, cluster is functioning, however there are requests failing Is there anyway we can start ignite node in maintenance mode and try running clean corrupted commands Thanks Naveen