Re: How to fix lost partitions gracefully?

Данилов Семён Tue, 19 Jul 2022 00:44:42 -0700

Hello Роза,

You're right, having persistence enabled should prevent the cluster losing 
partitions, given that all nodes are online of course. So if a node (or whole 
cluster goes down), they should not be lost after the restart.
I have a couple of questions:
1. Do I understand correctly that you observe server nodes go down and k8s 
recreate them?
2. Can you provide your cluster configuration?
3. Can you check that nodes that are started are the same nodes that went down? 
(Re-)started node should have the same consistent id as the node that went 
down. If it doesn't, then it's a brand new node with no persistence.


Regards, Semyon. 

> Hi Stephen!
> 
> Thank for your reply!
> 
> 2. Well, that's the problem - I can't figure out why all server nodes go 
> down. Nobody uses this cluster except my two apps with clients nodes. And 
> nothing happens before unexpected shutdown and recreation of server pods. k8s 
> cluster seems fine as well.
> 
> 3. Also I have persistence enabled (with saving data on disk in k8s single 
> node). Why when server-pods are recreated they can't restore their caches 
> from persistence automatically? I thought this is the main goal of 
> persistence - to save data.
> 4. Unfortunately resetting partitions didn't help :( Control script return 0 
> exit code but it was still impossible to retrieve data from corrupted cache 
> (same error). So I deleted cache data, redeploy the whole Ignite cluster and 
> now everything works fine.
> But it's very costly to do this every time when Ignite server nodes are 
> recreated which shouldn't be "stop-the-world" problem as data is saved.
> 
> 5. I guess that backuping partitions will not help as both nodes went 
> shutdown at the same time. It seems for me then that all partitions will be 
> lost including those that were back-upped.
> 
> Best regards,
> 
> Rose.
> 
> From: Stephen Darlington <[email protected]>
> 
> Sent: Monday, July 18, 2022 5:54 PM
> 
> To: user
> 
> Subject: Re: How to fix lost partitions gracefully?
> 
> Client nodes disconnecting is not the problem here. You have
> server nodes going down.
> 
> Caches are split into partitions, which are then distributed across the nodes 
> in your cluster. If one of your data nodes goes down, and you have not 
> configured any backup partitions, then you will lose some partitions and the 
> data in them.
> 
> There’s a script you can run to “reset lost partitions”: control-script
> 
> Of course this does not magically bring the data back.
> 
> You perhaps need to consider more nodes and configure your caches with at 
> least one backup.
> 
>> On 18 Jul 2022, at 12:49, Айсина Роза <[email protected]> wrote:
>>
>> Hello!
>>
>> We have Ignite standalone cluster in k8s environment with 2 server nodes and 
>> several clients - Java Spring application and Spark application.
>>
>> Both apps raise client nodes to connect to cluster each two hours (rolling 
>> update redeploy of both apps happens).
>>
>> The whole setup is in k8s in one neamespace.
>>
>> There is strange behavior we see sporadically after
>> several weeks.
>>
>> Cache both apps using often becomes corrupted with the following exception:
>>
>> [10:57:43,951][SEVERE][client-connector-#2796][ClientListenerNioListener] 
>> Failed to process client request 
>> [req=o.a.i.i.processors.platform.client.cache.ClientCacheScanQueryRequest@78481268,
>>  msg=class
>> o.a.i.i.processors.cache.CacheInvalidStateException: Failed to execute query 
>> because cache partition has been lostParts [cacheName=PipelineConfig, 
>> part=0]]
>>
>> javax.cache.CacheException: class 
>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: 
>> Failed to execute query because cache partition
>> has been lostParts [cacheName=PipelineConfig, part=0]
>>
>> I investigated through server logs from both Ignite nodes and found some 
>> events that I cannot to understand.
>>
>> I attached logs - one with keyword = "Exception" to locate errors and the 
>> other - original logs when first lost partitions error happens.
>>
>> It seems that this error is causing this behavior: Failed to shutdown socket
>>
>> After this all interactions with cluster becomes impossible.
>>
>> Also there are so many errors like this: Client disconnected abruptly due to 
>> network connection loss or because the connection was
>> left open on application shutdown.
>>
>> So I have two questions:
>>
>> 2. Can you please help to investigate the main reason for lost partitions 
>> error and how to handle it automatically? Right now I manually redeploy the 
>> whole cluster and then all applications connected to it which is insane and 
>> very human-dependent.
>> 3. Is there any special actions I need to do to gracefully handle client 
>> nodes when apps are going to shutdown? Is it possible that often (each 2h) 
>> connect-then-die events from client nodes cause this behavior?
>>
>> Thank you in advance! Looking forward for any help! 🙏
> 
> Информация данного сообщения содержит коммерческую тайну Общества с 
> ограниченной ответственностью «ГПМ Дата», ОГРН 1207700499863 (далее – ООО 
> «ГПМ Дата») и предназначена только для лиц, которым адресовано данное 
> сообщение. Если Вы получили данное сообщение
> по ошибке, просим Вас удалить его и не использовать полученную информацию, 
> составляющую коммерческую тайну ООО «ГПМ Дата».
> 
> В соответствии с действующим законодательством Российской Федерации ООО «ГПМ 
> Дата» вправе требовать от лиц, получивших доступ к информации, составляющей 
> коммерческую тайну, в результате действий, совершенных случайно или по 
> ошибке, охраны конфиденциальности
> этой информации.
> 
> Настоящее сообщение не является вступлением в переговоры о заключении 
> каких-либо договоров или соглашений, не свидетельствует о каком-либо 
> безусловном намерении ООО «ГПМ Дата» заключить какой-либо договор или 
> соглашение, не является заверением об обстоятельствах,
> которые должны быть доведены до сведения другой стороны. Настоящее сообщение 
> не является офертой, акцептом оферты, равно как не является предварительным 
> соглашением и носит исключительно информационный и необязывающий характер. 
> ООО «ГПМ Дата» оставляет за
> собой право на прекращение настоящей переписки в любое время.

Re: How to fix lost partitions gracefully?

Reply via email to