Hi Роза, In addition to my previous answer:
Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ign...@config.xml.service: State 'stop-final-sigterm' timed out. Killing. Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: apache-ign...@config.xml.service: Killing process 11135 (java) with signal SIGKILL. Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: apache-ign...@config.xml.service: Failed with result 'timeout'. Your nodes were killed (SIGKILL), so there was no graceful shutdown. And as I said earlier, you should trigger a rebalance (i.e. remove stopping nodes from baseline) and wait for rebalancing. After rebalancing nodes removed from baseline will be gracefully shut down. Also about this feature you can read in [1]. 1. https://ignite.apache.org/docs/latest/starting-nodes#shutting-down-nodes вт, 22 нояб. 2022 г. в 22:11, Ilya Shishkov <shishkovi...@gmail.com>: > About baseline topology you can read in documentation [1]. Manual baseline > baseline management can be done available by means of control script [2]. > > Links: > 1. https://ignite.apache.org/docs/latest/clustering/baseline-topology > 2. > https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management > > вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <shishkovi...@gmail.com>: > >> There is a typo here: >> > Lost partitions are expected behaviour in case of partition because you >> have only 1 backup and lost two nodes. >> >> I mean, that lost partitions are expected behaviour in case of >> partitioned caches when the number of offline nodes is more than the number >> of backups. In your case there are 1 backup and 2 offline nodes. >> >> вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <shishkovi...@gmail.com>: >> >>> Hi, >>> > 1) What can I do to recover from partitions lost problem after >>> shutting down several nodes? >>> > I thought that in case of graceful shutdown this problem must be >>> solved. >>> > Now I can recover by returning *one* of offline nodes to cluster >>> (starting the service) and running *reset_lost_partitions* command for >>> broken cache. After this cache becomes available. >>> >>> Are caches with lost partitions replicated or partitioned? Lost >>> partitions are expected behaviour in case of partition because you have >>> only 1 backup and lost two nodes. If you want from cluster data to remain >>> fully available in case of 2 nodes, you should set 2 backups for >>> partitioned caches. >>> >>> As for graceful shutdown: why do you expect that data would not be lost? >>> If you have 1 backup and 1 offline node, then there are some partitions >>> without backups, because the latter remains inaccessible while their owner >>> is offline. So, if you shutdown another one node with such partitions, they >>> will be lost. >>> >>> So, for persistent clusters if you are in a situation, when you should >>> work a long time without backups (i.e. with offline nodes, BUT without >>> partition loss), you should trigger a rebalance. It can be done manually or >>> automatically by changing the baseline. >>> After rebalancing, the amount of data copies will be restored. >>> >>> Now you should bring back at least one of the nodes, in order to make >>> partitions available. But if you need a full set of primary and partitions >>> you need all baseline nodes in the cluster. >>> >>> 2) What can I do to prevent this problem in scenario with automatic >>> cluster deployment? Should I add *reset_lost_partitions* command after >>> activation or redeploy? >>> >>> I don't fully understand what you mean, but there are no problems with >>> automatic deployments. In most cases, the situation with >>> partition losses tells that cluster is in invalid state. >>> >>> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна < >>> roza.ays...@sbermarket.ru>: >>> >>>> Hi Sumit! >>>> >>>> Thanks for your reply! >>>> >>>> Yeah, I have used this utility reset_lost_partitions many times. >>>> >>>> The problem is that this function requires all baseline nodes to be >>>> present. >>>> If I shutdown node auto adjustment does not remove this node from >>>> baseline topology and reset_lost_partitions ends with error that all >>>> partition owners have left the grid, partition data has been lost. >>>> >>>> So I remove them manually and this operation succeeds but with loss of >>>> data on offline nodes. >>>> >>>> What I am trying to understand is that why graceful shutdown do not >>>> handles this situation in case of backup caches and persistance. >>>> How can we automatically raise Ignite nodes if after redeploy data is >>>> lost because cluster can’t handle lost partitions problem? >>>> >>>> Best regards, >>>> Rose. >>>> >>>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <sumit.deshi...@gmail.com> >>>> wrote: >>>> >>>> Внимание: Внешний отправитель! >>>> Если вы не знаете отправителя - не открывайте вложения, не переходите >>>> по ссылкам, не пересылайте письмо! >>>> >>>> Please check if this helps: >>>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss >>>> Also any reason baseline auto adjustment is disabled? >>>> >>>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна < >>>> roza.ays...@sbermarket.ru> wrote: >>>> >>>>> Hola again! >>>>> >>>>> I discovered that enabling graceful shutdown via does not work. >>>>> >>>>> In service logs I see that nothing happens when *SIGTERM* comes :( >>>>> Eventually stopping action has been timed out and *SIGKILL* has been >>>>> sent which causes ungraceful shutdown. >>>>> Timeout is set to *10 minutes*. >>>>> >>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite >>>>> In-Memory Computing Platform Service... >>>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite >>>>> In-Memory Computing Platform Service. >>>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite >>>>> In-Memory Computing Platform Service... >>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>>>> apache-ign...@config.xml.service: State 'stop-final-sigterm' timed >>>>> out. Killing. >>>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>>>> apache-ign...@config.xml.service: Killing process 11135 (java) with >>>>> signal SIGKILL. >>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: >>>>> apache-ign...@config.xml.service: Failed with result 'timeout'. >>>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite >>>>> In-Memory Computing Platform Service. >>>>> >>>>> >>>>> I also enabled *DEBUG* level and see that nothing happens after >>>>> rebalancing started (this is the end of log): >>>>> >>>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown >>>>> hook... >>>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in >>>>> progress (ignoring): Shutdown in progress >>>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that >>>>> caches have sufficient backups and local rebalance completion... >>>>> >>>>> >>>>> I forgot to add that service is tarted with *service.sh*, not >>>>> *ignite.sh*. >>>>> >>>>> Please help! >>>>> >>>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна < >>>>> roza.ays...@sbermarket.ru> wrote: >>>>> >>>>> Hola! >>>>> I have a problem recovering from cluster crash in case when >>>>> persistence is enabled. >>>>> >>>>> Our setup is >>>>> - 5 VM nodes with 40G Ram and 200GB disk, >>>>> - persistence is enabled (on separate disk on each VM), >>>>> - all cluster actions are made through Ansible playbooks, >>>>> - all caches are either partitioned with backups = 1 or replicated, >>>>> - cluster starts as the service with running ignite.sh, >>>>> - baseline auto adjust is disabled. >>>>> >>>>> Also following the docs about partition loss policy I have added >>>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait >>>>> until partition rebalancing. >>>>> >>>>> What problem we have: after shutting down several nodes (2 go 5) one >>>>> after another exception about lost partitions is raised. >>>>> >>>>> *Caused by: >>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: >>>>> Failed to execute query because cache partition has been lostPart >>>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]* >>>>> >>>>> But in logs of dead nodes I see that all shutdown hooks are called as >>>>> expected on both nodes: >>>>> >>>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown >>>>> hook... >>>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that >>>>> caches have sufficient backups and local rebalance completion... >>>>> >>>>> >>>>> >>>>> And baseline topology looks like this (with 2 offline nodes as >>>>> expected): >>>>> >>>>> Cluster state: active >>>>> Current topology version: 23 >>>>> Baseline auto adjustment disabled: softTimeout=30000 >>>>> >>>>> Current topology version: 23 (Coordinator: >>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, >>>>> Order=3) >>>>> >>>>> Baseline nodes: >>>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, >>>>> Address=172.17.0.1, State=ONLINE, Order=3 >>>>> ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, >>>>> Address=172.17.0.1, State=ONLINE, Order=21 >>>>> ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, >>>>> Address=172.17.0.1, State=ONLINE, Order=5 >>>>> ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE >>>>> ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE >>>>> >>>>> -------------------------------------------------------------------------------- >>>>> Number of baseline nodes: 5 >>>>> >>>>> Other nodes not found. >>>>> >>>>> >>>>> >>>>> So my questions are: >>>>> >>>>> 1) What can I do to recover from partitions lost problem after >>>>> shutting down several nodes? I thought that in case of graceful shutdown >>>>> this problem must be solved. >>>>> >>>>> Now I can recover by returning *one* of offline nodes to cluster >>>>> (starting the service) and running *reset_lost_partitions* command >>>>> for broken cache. After this cache becomes available. >>>>> >>>>> 2) What can I do to prevent this problem in scenario with automatic >>>>> cluster deployment? Should I add *reset_lost_partitions* command >>>>> after activation or redeploy? >>>>> >>>>> Please help. >>>>> Thanks in advance! >>>>> >>>>> Best regards, >>>>> Rose. >>>>> >>>>> *--* >>>>> >>>>> *Роза Айсина* >>>>> Старший разработчик ПО >>>>> *СберМаркет* | Доставка из любимых магазинов >>>>> >>>>> >>>>> Email: roza.ays...@sbermarket.ru >>>>> Mob: >>>>> Web: sbermarket.ru >>>>> App: iOS >>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>>> и Android >>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>>>> документы, приложенные к нему, содержат конфиденциальную информацию. >>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>>>> Вам, использование, копирование, распространение информации, содержащейся >>>>> в >>>>> настоящем сообщении, а также осуществление любых действий на основе этой >>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите >>>>> это >>>>> сообщение. >>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>>>> confidential. If you are not the intended recipient you are notified that >>>>> using, copying, distributing or taking any action in reliance on the >>>>> contents of this information is strictly prohibited. If you have received >>>>> this email in error please notify the sender and delete this email. >>>>> *--* >>>>> >>>>> *Роза Айсина* >>>>> Старший разработчик ПО >>>>> *СберМаркет* | Доставка из любимых магазинов >>>>> >>>>> >>>>> Email: roza.ays...@sbermarket.ru >>>>> Mob: >>>>> Web: sbermarket.ru >>>>> App: iOS >>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>>> и Android >>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *--* >>>>> >>>>> *Роза Айсина* >>>>> Старший разработчик ПО >>>>> *СберМаркет* | Доставка из любимых магазинов >>>>> >>>>> >>>>> Email: roza.ays...@sbermarket.ru >>>>> Mob: >>>>> Web: sbermarket.ru >>>>> App: iOS >>>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>>> и Android >>>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>>>> документы, приложенные к нему, содержат конфиденциальную информацию. >>>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>>>> Вам, использование, копирование, распространение информации, содержащейся >>>>> в >>>>> настоящем сообщении, а также осуществление любых действий на основе этой >>>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите >>>>> это >>>>> сообщение. >>>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>>>> confidential. If you are not the intended recipient you are notified that >>>>> using, copying, distributing or taking any action in reliance on the >>>>> contents of this information is strictly prohibited. If you have received >>>>> this email in error please notify the sender and delete this email. >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Sumit Deshinge >>>> >>>> >>>> *--* >>>> >>>> *Роза Айсина* >>>> >>>> Старший разработчик ПО >>>> >>>> *СберМаркет* | Доставка из любимых магазинов >>>> >>>> >>>> >>>> Email: roza.ays...@sbermarket.ru >>>> >>>> Mob: >>>> >>>> Web: sbermarket.ru >>>> >>>> App: iOS >>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>> и Android >>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>> >>>> >>>> >>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>>> документы, приложенные к нему, содержат конфиденциальную информацию. >>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>>> Вам, использование, копирование, распространение информации, содержащейся в >>>> настоящем сообщении, а также осуществление любых действий на основе этой >>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>>> сообщение. >>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>>> confidential. If you are not the intended recipient you are notified that >>>> using, copying, distributing or taking any action in reliance on the >>>> contents of this information is strictly prohibited. If you have received >>>> this email in error please notify the sender and delete this email. >>>> >>>