Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up

2024-05-29 Thread Jeremy McMillan
If backup partitions are available when a node is lost, we should not
expect lost partitions.

There is a lot more to this story than this thread explains, so for the
community: please don't follow this procedure.

https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy
"A partition is lost when both the primary copy and all backup copies of
the partition are not available to the cluster, i.e. when the primary and
backup nodes for the partition become unavailable."

If you attempt to access a cache and receive a lost partitions error, this
means there IS DATA LOSS. Partition loss means there are no primary or
backup copies of a particular cache partition available. Have multiple
server nodes experienced trouble? Can we be certain that the affected
caches were created with backups>=1?

If a node fails to start up, and complains about maintenance tasks, we
should be very suspicious this node's persistent data is corrupted. If the
cluster is activated with a missing node and caches have lost partitions,
then we know these caches have lost some data. If there are no lost
partitions, we can safely remove the corrupted node from the baseline and
bring up a fresh node, and add it to the baseline to replace it thus
restoring redundancy. If there are lost partitions and we need to reset
lost partitions to bring a cache back online, we should expect that cache
is missing some data and may need to be reloaded.

Cache configuration backups=2 is excessive except in edge cases. For
backups=n, the memory and persistence footprint cost is n+1 times the
nominal data footprint. This scales linear. The marginal utility we derive
from each additional backup copy is diminishing because for a probability
of any single node failure p or p/1, the likelihood of needing those extra
copies is p/(n+1) for n backup copies.

Think of backup partitions like multiple coats of paint. After the second
coat, nobody will be able to tell the difference if you applied a third or
fourth coat. It still takes the same effort and materials to apply each
coat of paint.

If you NEED fault tolerance, then it should be mandatory to conduct testing
to make sure the configuration you have chosen is working as expected. If
backups=1 isn't effective for single node failures, then backups=2 will
make no beneficial difference. With backups=1 we should expect a cache to
work without complaining about lost partitions when one server node is
offline.

On Wed, May 29, 2024 at 12:15 PM Naveen Kumar 
wrote:

> Thanks very much for your prompt response Gianluca
>
> just for the community, I could solve this by running the control.sh with
> reset lost partitions for individual cachereset_lost_partitions
> looks like it worked, those partition issue is resolved, I suppose there
> wouldnt be any data loss as we have set all our caches with 2 replicas
>
> coming to the node which was not getting added to the cluster earlier,
> removed from baseline --> cleared all persistence store --> brought up the
> node --> added the node to baseline, this also seems to have worked fine.
>
> Thanks
>
>
> On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti <
> gianluca.bone...@gmail.com> wrote:
>
>> Hello Naveen
>>
>> Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact.
>> Three bugfix releases had been rolled out over time up to 2.16 release.
>>
>> It seems you are restarting your cluster on a regular basis, so you'd
>> better upgrade to 2.16 as soon as possible.
>> Otherwise it will also be very difficult for people on a community based
>> mailing list, on volunteer time, to work out a solution with a 2 years old
>> version running.
>>
>> Besides that, you are not providing very much information about your
>> cluster setup.
>> How many nodes, what infrastructure, how many caches, overall data size.
>> One could only guess you have more than 1 node running, with at least 1
>> cache, and non-empty dataset. :)
>>
>> This document from GridGain may be helpful but I don't see the same for
>> Ignite, it may still be worth checking it out.
>>
>> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode
>>
>> On the other hand you should also check your failing node.
>> If it is always the same node failing, then there should be some root
>> cause apart from Ignite.
>> Indeed if the nodes configuration is the same across all nodes, and just
>> this one fails, you should also consider some network issues (check
>> connectivity and network latency between nodes) and hardware related issues
>> (faulty disks, faulty memory)
>> In the end, one option might be to replace the faulty machine with a
>> brand new one.
>> In cloud environments this is actually quite cheap and easy to do.
>>
>> Cheers
>> Gianluca
>>
>> On Wed, 29 May 2024 at 08:43, Naveen Kumar 
>> wrote:
>>
>>> Hello All
>>>
>>> We are using Ignite 2.13.0
>>>
>>> After a cluster restart, one of the node is not coming up and in node
>>> logs are seeing this error - Nod

Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up

2024-05-29 Thread Naveen Kumar
Thanks very much for your prompt response Gianluca

just for the community, I could solve this by running the control.sh with
reset lost partitions for individual cachereset_lost_partitions
looks like it worked, those partition issue is resolved, I suppose there
wouldnt be any data loss as we have set all our caches with 2 replicas

coming to the node which was not getting added to the cluster earlier,
removed from baseline --> cleared all persistence store --> brought up the
node --> added the node to baseline, this also seems to have worked fine.

Thanks


On Wed, May 29, 2024 at 5:13 PM Gianluca Bonetti 
wrote:

> Hello Naveen
>
> Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact.
> Three bugfix releases had been rolled out over time up to 2.16 release.
>
> It seems you are restarting your cluster on a regular basis, so you'd
> better upgrade to 2.16 as soon as possible.
> Otherwise it will also be very difficult for people on a community based
> mailing list, on volunteer time, to work out a solution with a 2 years old
> version running.
>
> Besides that, you are not providing very much information about your
> cluster setup.
> How many nodes, what infrastructure, how many caches, overall data size.
> One could only guess you have more than 1 node running, with at least 1
> cache, and non-empty dataset. :)
>
> This document from GridGain may be helpful but I don't see the same for
> Ignite, it may still be worth checking it out.
>
> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode
>
> On the other hand you should also check your failing node.
> If it is always the same node failing, then there should be some root
> cause apart from Ignite.
> Indeed if the nodes configuration is the same across all nodes, and just
> this one fails, you should also consider some network issues (check
> connectivity and network latency between nodes) and hardware related issues
> (faulty disks, faulty memory)
> In the end, one option might be to replace the faulty machine with a brand
> new one.
> In cloud environments this is actually quite cheap and easy to do.
>
> Cheers
> Gianluca
>
> On Wed, 29 May 2024 at 08:43, Naveen Kumar 
> wrote:
>
>> Hello All
>>
>> We are using Ignite 2.13.0
>>
>> After a cluster restart, one of the node is not coming up and in node
>> logs are seeing this error - Node requires maintenance, non-empty set of
>> maintainance  tasks is found - node is not coming up
>>
>> we are getting errors like time out is reached before computation is
>> completed error in other nodes as well.
>>
>> I could see that, we have control.sh script to backup and clean up the
>> corrupted files, but when I run the command, it fails.
>>
>> I have removed the node from baseline and tried to run as well, still its
>> failing
>>
>> what could be the solution for this, cluster is functioning,
>> however there are requests failing
>>
>> Is there anyway we can start ignite node in  maintenance mode and try
>> running clean corrupted commands
>>
>> Thanks
>> Naveen
>>
>>
>>

-- 
Thanks & Regards,
Naveen Bandaru


Apache Ignite - Load balancer nodes standalone

2024-05-29 Thread Eduardo Paludo via user
I installed 4 apache ignite nodes on 4 virtual machines that are in a vmware 
cluster, one question I have is how do I connect to the cluster? Do I connect 
via the IP of any node? Or do I need to put nginx "in front" to do the load 
balancing? I didn't find anything about this in the official documentation.


Eduardo Henrique Paludo
Analista de Redes e Comunicação de Dados
(49) 3361-6661 - Ramal: 8074
Matriz Chapecó - SC
Visite nosso site: www.expressosaomiguel.com.br
[http://www.expressosaomiguel.com.br/logotipo_email/V3.png]
O conteúdo deste e-mail é confidencial e destinado exclusivamente ao 
destinatário especificado apenas na mensagem. É estritamente proibido 
compartilhar qualquer parte desta mensagem com terceiros, sem o consentimento 
por escrito do remetente. Se você recebeu esta mensagem por engano, responda a 
esta mensagem e siga com sua exclusão, para que possamos garantir que tal erro 
não ocorra no futuro.


Re: Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up

2024-05-29 Thread Gianluca Bonetti
Hello Naveen

Apache Ignite 2.13 is more than 2 years old, 25 months old in actual fact.
Three bugfix releases had been rolled out over time up to 2.16 release.

It seems you are restarting your cluster on a regular basis, so you'd
better upgrade to 2.16 as soon as possible.
Otherwise it will also be very difficult for people on a community based
mailing list, on volunteer time, to work out a solution with a 2 years old
version running.

Besides that, you are not providing very much information about your
cluster setup.
How many nodes, what infrastructure, how many caches, overall data size.
One could only guess you have more than 1 node running, with at least 1
cache, and non-empty dataset. :)

This document from GridGain may be helpful but I don't see the same for
Ignite, it may still be worth checking it out.
https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/maintenance-mode

On the other hand you should also check your failing node.
If it is always the same node failing, then there should be some root cause
apart from Ignite.
Indeed if the nodes configuration is the same across all nodes, and just
this one fails, you should also consider some network issues (check
connectivity and network latency between nodes) and hardware related issues
(faulty disks, faulty memory)
In the end, one option might be to replace the faulty machine with a brand
new one.
In cloud environments this is actually quite cheap and easy to do.

Cheers
Gianluca

On Wed, 29 May 2024 at 08:43, Naveen Kumar  wrote:

> Hello All
>
> We are using Ignite 2.13.0
>
> After a cluster restart, one of the node is not coming up and in node logs
> are seeing this error - Node requires maintenance, non-empty set of
> maintainance  tasks is found - node is not coming up
>
> we are getting errors like time out is reached before computation is
> completed error in other nodes as well.
>
> I could see that, we have control.sh script to backup and clean up the
> corrupted files, but when I run the command, it fails.
>
> I have removed the node from baseline and tried to run as well, still its
> failing
>
> what could be the solution for this, cluster is functioning, however there
> are requests failing
>
> Is there anyway we can start ignite node in  maintenance mode and try
> running clean corrupted commands
>
> Thanks
> Naveen
>
>
>


Node requires maintenance, non-empty set of maintainance tasks is found - node is not coming up

2024-05-29 Thread Naveen Kumar
Hello All

We are using Ignite 2.13.0

After a cluster restart, one of the node is not coming up and in node logs
are seeing this error - Node requires maintenance, non-empty set of
maintainance  tasks is found - node is not coming up

we are getting errors like time out is reached before computation is
completed error in other nodes as well.

I could see that, we have control.sh script to backup and clean up the
corrupted files, but when I run the command, it fails.

I have removed the node from baseline and tried to run as well, still its
failing

what could be the solution for this, cluster is functioning, however there
are requests failing

Is there anyway we can start ignite node in  maintenance mode and try
running clean corrupted commands

Thanks
Naveen