Re: Crash recover for Ignite persistence cluster (lost partitions case)

Ilya Shishkov Tue, 22 Nov 2022 11:11:29 -0800

About baseline topology you can read in documentation [1]. Manual baseline
baseline management can be done available by means of control script [2].


Links:
1. https://ignite.apache.org/docs/latest/clustering/baseline-topology
2.
https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management

вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <[email protected]>:

> There is a typo here:
> > Lost partitions are expected behaviour in case of partition because you
> have only 1 backup and lost two nodes.
>
> I mean, that lost partitions are expected behaviour in case of partitioned
> caches when the number of offline nodes is more than the number of backups.
> In your case there are 1 backup and 2 offline nodes.
>
> вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <[email protected]>:
>
>> Hi,
>> > 1) What can I do to recover from partitions lost problem after shutting
>> down several nodes?
>> > I thought that in case of graceful shutdown this problem must be
>> solved.
>> > Now I can recover by returning *one* of offline nodes to cluster
>> (starting the service) and running *reset_lost_partitions* command for
>> broken cache. After this cache becomes available.
>>
>> Are caches with lost partitions replicated or partitioned? Lost
>> partitions are expected behaviour in case of partition because you have
>> only 1 backup and lost two nodes. If you want from cluster data to remain
>> fully available in case of 2 nodes, you should set 2 backups for
>> partitioned caches.
>>
>> As for graceful shutdown: why do you expect that data would not be lost?
>> If you have 1 backup and 1 offline node, then there are some partitions
>> without backups, because the latter remains inaccessible while their owner
>> is offline. So, if you shutdown another one node with such partitions, they
>> will be lost.
>>
>> So, for persistent clusters if you are in a situation, when you should
>> work a long time without backups (i.e. with offline nodes, BUT without
>> partition loss), you should trigger a rebalance. It can be done manually or
>> automatically by changing the baseline.
>> After rebalancing, the amount of data copies will be restored.
>>
>> Now you should bring back at least one of the nodes, in order to make
>> partitions available. But if you need a full set of primary and partitions
>> you need all baseline nodes in the cluster.
>>
>> 2) What can I do to prevent this problem in scenario with automatic
>> cluster deployment? Should I add *reset_lost_partitions* command after
>> activation or redeploy?
>>
>> I don't fully understand what you mean, but there are no problems with
>> automatic deployments. In most cases, the situation with
>> partition losses tells that cluster is in invalid state.
>>
>> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
>> [email protected]>:
>>
>>> Hi Sumit!
>>>
>>> Thanks for your reply!
>>>
>>> Yeah, I have used this utility reset_lost_partitions many times.
>>>
>>> The problem is that this function requires all baseline nodes to be
>>> present.
>>> If I shutdown node auto adjustment does not remove this node from
>>> baseline topology and reset_lost_partitions ends with error that all
>>> partition owners have left the grid, partition data has been lost.
>>>
>>> So I remove them manually and this operation succeeds but with loss of
>>> data on offline nodes.
>>>
>>> What I am trying to understand is that why graceful shutdown do not
>>> handles this situation in case of backup caches and persistance.
>>> How can we automatically raise Ignite nodes if after redeploy data is
>>> lost because cluster can’t handle lost partitions problem?
>>>
>>> Best regards,
>>> Rose.
>>>
>>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <[email protected]>
>>> wrote:
>>>
>>> Внимание: Внешний отправитель!
>>> Если вы не знаете отправителя - не открывайте вложения, не переходите по
>>> ссылкам, не пересылайте письмо!
>>>
>>> Please check if this helps:
>>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>>> Also any reason baseline auto adjustment is disabled?
>>>
>>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>>> [email protected]> wrote:
>>>
>>>> Hola again!
>>>>
>>>> I discovered that enabling graceful shutdown via does not work.
>>>>
>>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>>> sent which causes ungraceful shutdown.
>>>> Timeout is set to *10 minutes*.
>>>>
>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>>> In-Memory Computing Platform Service...
>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>>> In-Memory Computing Platform Service.
>>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>>> In-Memory Computing Platform Service...
>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>> [email protected]: State 'stop-final-sigterm' timed
>>>> out. Killing.
>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>> [email protected]: Killing process 11135 (java) with
>>>> signal SIGKILL.
>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>>> [email protected]: Failed with result 'timeout'.
>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>>> In-Memory Computing Platform Service.
>>>>
>>>>
>>>> I also enabled *DEBUG* level and see that nothing happens after
>>>> rebalancing started (this is the end of log):
>>>>
>>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>>> hook...
>>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>>> progress (ignoring): Shutdown in progress
>>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
>>>> have sufficient backups and local rebalance completion...
>>>>
>>>>
>>>> I forgot to add that service is tarted with *service.sh*, not
>>>> *ignite.sh*.
>>>>
>>>> Please help!
>>>>
>>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>>> [email protected]> wrote:
>>>>
>>>> Hola!
>>>> I have a problem recovering from cluster crash in case when persistence
>>>> is enabled.
>>>>
>>>> Our setup is
>>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>>> - persistence is enabled (on separate disk on each VM),
>>>> - all cluster actions are made through Ansible playbooks,
>>>> - all caches are either partitioned with backups = 1 or replicated,
>>>> - cluster starts as the service with running ignite.sh,
>>>> - baseline auto adjust is disabled.
>>>>
>>>> Also following the docs about partition loss policy I have added
>>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>>> until partition rebalancing.
>>>>
>>>> What problem we have: after shutting down several nodes (2 go 5) one
>>>> after another exception about lost partitions is raised.
>>>>
>>>> *Caused by:
>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>>> Failed to execute query because cache partition has been lostPart
>>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>>
>>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>>> expected on both nodes:
>>>>
>>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>>> hook...
>>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
>>>> have sufficient backups and local rebalance completion...
>>>>
>>>>
>>>>
>>>> And baseline topology looks like this (with 2 offline nodes as
>>>> expected):
>>>>
>>>> Cluster state: active
>>>> Current topology version: 23
>>>> Baseline auto adjustment disabled: softTimeout=30000
>>>>
>>>> Current topology version: 23 (Coordinator:
>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>>> Order=3)
>>>>
>>>> Baseline nodes:
>>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>>
>>>> --------------------------------------------------------------------------------
>>>> Number of baseline nodes: 5
>>>>
>>>> Other nodes not found.
>>>>
>>>>
>>>>
>>>> So my questions are:
>>>>
>>>> 1) What can I do to recover from partitions lost problem after shutting
>>>> down several nodes? I thought that in case of graceful shutdown this
>>>> problem must be solved.
>>>>
>>>> Now I can recover by returning *one* of offline nodes to cluster
>>>> (starting the service) and running *reset_lost_partitions* command for
>>>> broken cache. After this cache becomes available.
>>>>
>>>> 2) What can I do to prevent this problem in scenario with automatic
>>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>>> activation or redeploy?
>>>>
>>>> Please help.
>>>> Thanks in advance!
>>>>
>>>> Best regards,
>>>> Rose.
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: [email protected]
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: [email protected]
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>> Старший разработчик ПО
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>> Email: [email protected]
>>>> Mob:
>>>> Web: sbermarket.ru
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sumit Deshinge
>>>
>>>
>>> *--*
>>>
>>> *Роза Айсина*
>>>
>>> Старший разработчик ПО
>>>
>>> *СберМаркет* | Доставка из любимых магазинов
>>>
>>>
>>>
>>> Email: [email protected]
>>>
>>> Mob:
>>>
>>> Web: sbermarket.ru
>>>
>>> App: iOS
>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>> и Android
>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>
>>>
>>>
>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>> Вам, использование, копирование, распространение информации, содержащейся в
>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>> сообщение.
>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>> confidential. If you are not the intended recipient you are notified that
>>> using, copying, distributing or taking any action in reliance on the
>>> contents of this information is strictly prohibited. If you have received
>>> this email in error please notify the sender and delete this email.
>>>
>>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Reply via email to