Re: Crash recover for Ignite persistence cluster (lost partitions case)

Ilya Shishkov Wed, 23 Nov 2022 05:52:46 -0800

Hi Роза,

In addition to my previous answer:


Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
apache-ign...@config.xml.service: State 'stop-final-sigterm' timed out.
Killing.
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
apache-ign...@config.xml.service: Killing process 11135 (java) with signal
SIGKILL.
Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
apache-ign...@config.xml.service: Failed with result 'timeout'.

Your nodes were killed (SIGKILL), so there was no graceful shutdown. And as
I said earlier, you should trigger a rebalance (i.e. remove stopping nodes
from baseline) and wait for rebalancing. After rebalancing nodes removed
from baseline will be gracefully shut down. Also about this feature you can
read in [1].

1. https://ignite.apache.org/docs/latest/starting-nodes#shutting-down-nodes

вт, 22 нояб. 2022 г. в 22:11, Ilya Shishkov <shishkovi...@gmail.com>:

> About baseline topology you can read in documentation [1]. Manual baseline
> baseline management can be done available by means of control script [2].
>
> Links:
> 1. https://ignite.apache.org/docs/latest/clustering/baseline-topology
> 2.
> https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management
>
> вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <shishkovi...@gmail.com>:
>
>> There is a typo here:
>> > Lost partitions are expected behaviour in case of partition because you
>> have only 1 backup and lost two nodes.
>>
>> I mean, that lost partitions are expected behaviour in case of
>> partitioned caches when the number of offline nodes is more than the number
>> of backups. In your case there are 1 backup and 2 offline nodes.
>>
>> вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <shishkovi...@gmail.com>:
>>
>>> Hi,
>>> > 1) What can I do to recover from partitions lost problem after
>>> shutting down several nodes?
>>> > I thought that in case of graceful shutdown this problem must be
>>> solved.
>>> > Now I can recover by returning *one* of offline nodes to cluster
>>> (starting the service) and running *reset_lost_partitions* command for
>>> broken cache. After this cache becomes available.
>>>
>>> Are caches with lost partitions replicated or partitioned? Lost
>>> partitions are expected behaviour in case of partition because you have
>>> only 1 backup and lost two nodes. If you want from cluster data to remain
>>> fully available in case of 2 nodes, you should set 2 backups for
>>> partitioned caches.
>>>
>>> As for graceful shutdown: why do you expect that data would not be lost?
>>> If you have 1 backup and 1 offline node, then there are some partitions
>>> without backups, because the latter remains inaccessible while their owner
>>> is offline. So, if you shutdown another one node with such partitions, they
>>> will be lost.
>>>
>>> So, for persistent clusters if you are in a situation, when you should
>>> work a long time without backups (i.e. with offline nodes, BUT without
>>> partition loss), you should trigger a rebalance. It can be done manually or
>>> automatically by changing the baseline.
>>> After rebalancing, the amount of data copies will be restored.
>>>
>>> Now you should bring back at least one of the nodes, in order to make
>>> partitions available. But if you need a full set of primary and partitions
>>> you need all baseline nodes in the cluster.
>>>
>>> 2) What can I do to prevent this problem in scenario with automatic
>>> cluster deployment? Should I add *reset_lost_partitions* command after
>>> activation or redeploy?
>>>
>>> I don't fully understand what you mean, but there are no problems with
>>> automatic deployments. In most cases, the situation with
>>> partition losses tells that cluster is in invalid state.
>>>
>>> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна <
>>> roza.ays...@sbermarket.ru>:
>>>
>>>> Hi Sumit!
>>>>
>>>> Thanks for your reply!
>>>>
>>>> Yeah, I have used this utility reset_lost_partitions many times.
>>>>
>>>> The problem is that this function requires all baseline nodes to be
>>>> present.
>>>> If I shutdown node auto adjustment does not remove this node from
>>>> baseline topology and reset_lost_partitions ends with error that all
>>>> partition owners have left the grid, partition data has been lost.
>>>>
>>>> So I remove them manually and this operation succeeds but with loss of
>>>> data on offline nodes.
>>>>
>>>> What I am trying to understand is that why graceful shutdown do not
>>>> handles this situation in case of backup caches and persistance.
>>>> How can we automatically raise Ignite nodes if after redeploy data is
>>>> lost because cluster can’t handle lost partitions problem?
>>>>
>>>> Best regards,
>>>> Rose.
>>>>
>>>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <sumit.deshi...@gmail.com>
>>>> wrote:
>>>>
>>>> Внимание: Внешний отправитель!
>>>> Если вы не знаете отправителя - не открывайте вложения, не переходите
>>>> по ссылкам, не пересылайте письмо!
>>>>
>>>> Please check if this helps:
>>>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
>>>> Also any reason baseline auto adjustment is disabled?
>>>>
>>>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
>>>> roza.ays...@sbermarket.ru> wrote:
>>>>
>>>>> Hola again!
>>>>>
>>>>> I discovered that enabling graceful shutdown via does not work.
>>>>>
>>>>> In service logs I see that nothing happens when *SIGTERM* comes :(
>>>>> Eventually stopping action has been timed out and *SIGKILL* has been
>>>>> sent which causes ungraceful shutdown.
>>>>> Timeout is set to *10 minutes*.
>>>>>
>>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
>>>>> In-Memory Computing Platform Service...
>>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
>>>>> In-Memory Computing Platform Service.
>>>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
>>>>> In-Memory Computing Platform Service...
>>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ign...@config.xml.service: State 'stop-final-sigterm' timed
>>>>> out. Killing.
>>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ign...@config.xml.service: Killing process 11135 (java) with
>>>>> signal SIGKILL.
>>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
>>>>> apache-ign...@config.xml.service: Failed with result 'timeout'.
>>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
>>>>> In-Memory Computing Platform Service.
>>>>>
>>>>>
>>>>> I also enabled *DEBUG* level and see that nothing happens after
>>>>> rebalancing started (this is the end of log):
>>>>>
>>>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
>>>>> hook...
>>>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in
>>>>> progress (ignoring): Shutdown in progress
>>>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that
>>>>> caches have sufficient backups and local rebalance completion...
>>>>>
>>>>>
>>>>> I forgot to add that service is tarted with *service.sh*, not
>>>>> *ignite.sh*.
>>>>>
>>>>> Please help!
>>>>>
>>>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
>>>>> roza.ays...@sbermarket.ru> wrote:
>>>>>
>>>>> Hola!
>>>>> I have a problem recovering from cluster crash in case when
>>>>> persistence is enabled.
>>>>>
>>>>> Our setup is
>>>>> - 5 VM nodes with 40G Ram and 200GB disk,
>>>>> - persistence is enabled (on separate disk on each VM),
>>>>> - all cluster actions are made through Ansible playbooks,
>>>>> - all caches are either partitioned with backups = 1 or replicated,
>>>>> - cluster starts as the service with running ignite.sh,
>>>>> - baseline auto adjust is disabled.
>>>>>
>>>>> Also following the docs about partition loss policy I have added
>>>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait
>>>>> until partition rebalancing.
>>>>>
>>>>> What problem we have: after shutting down several nodes (2 go 5) one
>>>>> after another exception about lost partitions is raised.
>>>>>
>>>>> *Caused by:
>>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
>>>>> Failed to execute query because cache partition has been lostPart
>>>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>>>>>
>>>>> But in logs of dead nodes I see that all shutdown hooks are called as
>>>>> expected on both nodes:
>>>>>
>>>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
>>>>> hook...
>>>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that
>>>>> caches have sufficient backups and local rebalance completion...
>>>>>
>>>>>
>>>>>
>>>>> And baseline topology looks like this (with 2 offline nodes as
>>>>> expected):
>>>>>
>>>>> Cluster state: active
>>>>> Current topology version: 23
>>>>> Baseline auto adjustment disabled: softTimeout=30000
>>>>>
>>>>> Current topology version: 23 (Coordinator:
>>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
>>>>> Order=3)
>>>>>
>>>>> Baseline nodes:
>>>>>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b,
>>>>> Address=172.17.0.1, State=ONLINE, Order=3
>>>>>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b,
>>>>> Address=172.17.0.1, State=ONLINE, Order=21
>>>>>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e,
>>>>> Address=172.17.0.1, State=ONLINE, Order=5
>>>>>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>>>>>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>> Number of baseline nodes: 5
>>>>>
>>>>> Other nodes not found.
>>>>>
>>>>>
>>>>>
>>>>> So my questions are:
>>>>>
>>>>> 1) What can I do to recover from partitions lost problem after
>>>>> shutting down several nodes? I thought that in case of graceful shutdown
>>>>> this problem must be solved.
>>>>>
>>>>> Now I can recover by returning *one* of offline nodes to cluster
>>>>> (starting the service) and running *reset_lost_partitions* command
>>>>> for broken cache. After this cache becomes available.
>>>>>
>>>>> 2) What can I do to prevent this problem in scenario with automatic
>>>>> cluster deployment? Should I add *reset_lost_partitions* command
>>>>> after activation or redeploy?
>>>>>
>>>>> Please help.
>>>>> Thanks in advance!
>>>>>
>>>>> Best regards,
>>>>> Rose.
>>>>>
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.ays...@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>>> Вам, использование, копирование, распространение информации, содержащейся 
>>>>> в
>>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите 
>>>>> это
>>>>> сообщение.
>>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>>> confidential. If you are not the intended recipient you are notified that
>>>>> using, copying, distributing or taking any action in reliance on the
>>>>> contents of this information is strictly prohibited. If you have received
>>>>> this email in error please notify the sender and delete this email.
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.ays...@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *--*
>>>>>
>>>>> *Роза Айсина*
>>>>> Старший разработчик ПО
>>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>>
>>>>>
>>>>> Email: roza.ays...@sbermarket.ru
>>>>> Mob:
>>>>> Web: sbermarket.ru
>>>>> App: iOS
>>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>>> и Android
>>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>>> Вам, использование, копирование, распространение информации, содержащейся 
>>>>> в
>>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите 
>>>>> это
>>>>> сообщение.
>>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>>> confidential. If you are not the intended recipient you are notified that
>>>>> using, copying, distributing or taking any action in reliance on the
>>>>> contents of this information is strictly prohibited. If you have received
>>>>> this email in error please notify the sender and delete this email.
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sumit Deshinge
>>>>
>>>>
>>>> *--*
>>>>
>>>> *Роза Айсина*
>>>>
>>>> Старший разработчик ПО
>>>>
>>>> *СберМаркет* | Доставка из любимых магазинов
>>>>
>>>>
>>>>
>>>> Email: roza.ays...@sbermarket.ru
>>>>
>>>> Mob:
>>>>
>>>> Web: sbermarket.ru
>>>>
>>>> App: iOS
>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
>>>> и Android
>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>>>>
>>>>
>>>>
>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
>>>> документы, приложенные к нему, содержат конфиденциальную информацию.
>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
>>>> Вам, использование, копирование, распространение информации, содержащейся в
>>>> настоящем сообщении, а также осуществление любых действий на основе этой
>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
>>>> сообщение.
>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
>>>> confidential. If you are not the intended recipient you are notified that
>>>> using, copying, distributing or taking any action in reliance on the
>>>> contents of this information is strictly prohibited. If you have received
>>>> this email in error please notify the sender and delete this email.
>>>>
>>>

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Reply via email to