Re: How to fix lost partitions gracefully?

Stephen Darlington Mon, 18 Jul 2022 07:54:19 -0700

Client nodes disconnecting is not the problem here. You have server nodes going 
down.


Caches are split into partitions, which are then distributed across the nodes 
in your cluster. If one of your data nodes goes down, and you have not 
configured any backup partitions, then you will lose some partitions and the 
data in them.

There’s a script you can run to “reset lost partitions”: control-script 
<https://ignite.apache.org/docs/2.11.1/tools/control-script#resetting-lost-partitions>

Of course this does not magically bring the data back.

You perhaps need to consider more nodes and configure your caches with at least 
one backup.

> On 18 Jul 2022, at 12:49, Айсина Роза <[email protected]> wrote:
> 
> Hello! 
> 
> We have Ignite standalone cluster in k8s environment with 2 server nodes and 
> several clients - Java Spring application and Spark application.
> Both apps raise client nodes to connect to cluster each two hours (rolling 
> update redeploy of both apps happens).
> The whole setup is in k8s in one neamespace. 
> 
> There is strange behavior we see sporadically after several weeks.
> Cache both apps using often becomes corrupted with the following exception: 
> 
> [10:57:43,951][SEVERE][client-connector-#2796][ClientListenerNioListener] 
> Failed to process client request 
> [req=o.a.i.i.processors.platform.client.cache.ClientCacheScanQueryRequest@78481268,
>  msg=class o.a.i.i.processors.cache.CacheInvalidStateException: Failed to 
> execute query because cache partition has been lostParts 
> [cacheName=PipelineConfig, part=0]]
> javax.cache.CacheException: class 
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: 
> Failed to execute query because cache partition has been lostParts 
> [cacheName=PipelineConfig, part=0]
> 
> I investigated through server logs from both Ignite nodes and found some 
> events that I cannot to understand. 
> I attached logs - one with keyword = "Exception" to locate errors and the 
> other - original logs when first lost partitions error happens.
> 
> It seems that this error is causing this behavior: Failed to shutdown socket
> After this all interactions with cluster becomes impossible. 
> Also there are so many errors like this: Client disconnected abruptly due to 
> network connection loss or because the connection was left open on 
> application shutdown.
> 
> So I have two questions: 
> Can you please help to investigate the main reason for lost partitions error 
> and how to handle it automatically? Right now I manually redeploy the whole 
> cluster and then all applications connected to it which is insane and very 
> human-dependent.
> Is there any special actions I need to do to gracefully handle client nodes 
> when apps are going to shutdown? Is it possible that often (each 2h) 
> connect-then-die events from client nodes cause this behavior? 
> 
> Thank you in advance! Looking forward for any help! 🙏
>

Re: How to fix lost partitions gracefully?

Reply via email to