About baseline topology you can read in documentation [1]. Manual baseline baseline management can be done available by means of control script [2].
Links: 1. https://ignite.apache.org/docs/latest/clustering/baseline-topology 2. https://ignite.apache.org/docs/latest/tools/control-script#activation-deactivation-and-topology-management вт, 22 нояб. 2022 г. в 21:58, Ilya Shishkov <shishkovi...@gmail.com>: > There is a typo here: > > Lost partitions are expected behaviour in case of partition because you > have only 1 backup and lost two nodes. > > I mean, that lost partitions are expected behaviour in case of partitioned > caches when the number of offline nodes is more than the number of backups. > In your case there are 1 backup and 2 offline nodes. > > вт, 22 нояб. 2022 г. в 21:56, Ilya Shishkov <shishkovi...@gmail.com>: > >> Hi, >> > 1) What can I do to recover from partitions lost problem after shutting >> down several nodes? >> > I thought that in case of graceful shutdown this problem must be >> solved. >> > Now I can recover by returning *one* of offline nodes to cluster >> (starting the service) and running *reset_lost_partitions* command for >> broken cache. After this cache becomes available. >> >> Are caches with lost partitions replicated or partitioned? Lost >> partitions are expected behaviour in case of partition because you have >> only 1 backup and lost two nodes. If you want from cluster data to remain >> fully available in case of 2 nodes, you should set 2 backups for >> partitioned caches. >> >> As for graceful shutdown: why do you expect that data would not be lost? >> If you have 1 backup and 1 offline node, then there are some partitions >> without backups, because the latter remains inaccessible while their owner >> is offline. So, if you shutdown another one node with such partitions, they >> will be lost. >> >> So, for persistent clusters if you are in a situation, when you should >> work a long time without backups (i.e. with offline nodes, BUT without >> partition loss), you should trigger a rebalance. It can be done manually or >> automatically by changing the baseline. >> After rebalancing, the amount of data copies will be restored. >> >> Now you should bring back at least one of the nodes, in order to make >> partitions available. But if you need a full set of primary and partitions >> you need all baseline nodes in the cluster. >> >> 2) What can I do to prevent this problem in scenario with automatic >> cluster deployment? Should I add *reset_lost_partitions* command after >> activation or redeploy? >> >> I don't fully understand what you mean, but there are no problems with >> automatic deployments. In most cases, the situation with >> partition losses tells that cluster is in invalid state. >> >> вт, 22 нояб. 2022 г. в 19:49, Айсина Роза Мунеровна < >> roza.ays...@sbermarket.ru>: >> >>> Hi Sumit! >>> >>> Thanks for your reply! >>> >>> Yeah, I have used this utility reset_lost_partitions many times. >>> >>> The problem is that this function requires all baseline nodes to be >>> present. >>> If I shutdown node auto adjustment does not remove this node from >>> baseline topology and reset_lost_partitions ends with error that all >>> partition owners have left the grid, partition data has been lost. >>> >>> So I remove them manually and this operation succeeds but with loss of >>> data on offline nodes. >>> >>> What I am trying to understand is that why graceful shutdown do not >>> handles this situation in case of backup caches and persistance. >>> How can we automatically raise Ignite nodes if after redeploy data is >>> lost because cluster can’t handle lost partitions problem? >>> >>> Best regards, >>> Rose. >>> >>> On 22 Nov 2022, at 5:44 PM, Sumit Deshinge <sumit.deshi...@gmail.com> >>> wrote: >>> >>> Внимание: Внешний отправитель! >>> Если вы не знаете отправителя - не открывайте вложения, не переходите по >>> ссылкам, не пересылайте письмо! >>> >>> Please check if this helps: >>> https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss >>> Also any reason baseline auto adjustment is disabled? >>> >>> On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна < >>> roza.ays...@sbermarket.ru> wrote: >>> >>>> Hola again! >>>> >>>> I discovered that enabling graceful shutdown via does not work. >>>> >>>> In service logs I see that nothing happens when *SIGTERM* comes :( >>>> Eventually stopping action has been timed out and *SIGKILL* has been >>>> sent which causes ungraceful shutdown. >>>> Timeout is set to *10 minutes*. >>>> >>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite >>>> In-Memory Computing Platform Service... >>>> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite >>>> In-Memory Computing Platform Service. >>>> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite >>>> In-Memory Computing Platform Service... >>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>>> apache-ign...@config.xml.service: State 'stop-final-sigterm' timed >>>> out. Killing. >>>> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: >>>> apache-ign...@config.xml.service: Killing process 11135 (java) with >>>> signal SIGKILL. >>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: >>>> apache-ign...@config.xml.service: Failed with result 'timeout'. >>>> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite >>>> In-Memory Computing Platform Service. >>>> >>>> >>>> I also enabled *DEBUG* level and see that nothing happens after >>>> rebalancing started (this is the end of log): >>>> >>>> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown >>>> hook... >>>> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in >>>> progress (ignoring): Shutdown in progress >>>> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches >>>> have sufficient backups and local rebalance completion... >>>> >>>> >>>> I forgot to add that service is tarted with *service.sh*, not >>>> *ignite.sh*. >>>> >>>> Please help! >>>> >>>> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна < >>>> roza.ays...@sbermarket.ru> wrote: >>>> >>>> Hola! >>>> I have a problem recovering from cluster crash in case when persistence >>>> is enabled. >>>> >>>> Our setup is >>>> - 5 VM nodes with 40G Ram and 200GB disk, >>>> - persistence is enabled (on separate disk on each VM), >>>> - all cluster actions are made through Ansible playbooks, >>>> - all caches are either partitioned with backups = 1 or replicated, >>>> - cluster starts as the service with running ignite.sh, >>>> - baseline auto adjust is disabled. >>>> >>>> Also following the docs about partition loss policy I have added >>>> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait >>>> until partition rebalancing. >>>> >>>> What problem we have: after shutting down several nodes (2 go 5) one >>>> after another exception about lost partitions is raised. >>>> >>>> *Caused by: >>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: >>>> Failed to execute query because cache partition has been lostPart >>>> [cacheName=PUBLIC_StoreProductFeatures, part=512]* >>>> >>>> But in logs of dead nodes I see that all shutdown hooks are called as >>>> expected on both nodes: >>>> >>>> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown >>>> hook... >>>> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches >>>> have sufficient backups and local rebalance completion... >>>> >>>> >>>> >>>> And baseline topology looks like this (with 2 offline nodes as >>>> expected): >>>> >>>> Cluster state: active >>>> Current topology version: 23 >>>> Baseline auto adjustment disabled: softTimeout=30000 >>>> >>>> Current topology version: 23 (Coordinator: >>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, >>>> Order=3) >>>> >>>> Baseline nodes: >>>> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, >>>> Address=172.17.0.1, State=ONLINE, Order=3 >>>> ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, >>>> Address=172.17.0.1, State=ONLINE, Order=21 >>>> ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, >>>> Address=172.17.0.1, State=ONLINE, Order=5 >>>> ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE >>>> ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE >>>> >>>> -------------------------------------------------------------------------------- >>>> Number of baseline nodes: 5 >>>> >>>> Other nodes not found. >>>> >>>> >>>> >>>> So my questions are: >>>> >>>> 1) What can I do to recover from partitions lost problem after shutting >>>> down several nodes? I thought that in case of graceful shutdown this >>>> problem must be solved. >>>> >>>> Now I can recover by returning *one* of offline nodes to cluster >>>> (starting the service) and running *reset_lost_partitions* command for >>>> broken cache. After this cache becomes available. >>>> >>>> 2) What can I do to prevent this problem in scenario with automatic >>>> cluster deployment? Should I add *reset_lost_partitions* command after >>>> activation or redeploy? >>>> >>>> Please help. >>>> Thanks in advance! >>>> >>>> Best regards, >>>> Rose. >>>> >>>> *--* >>>> >>>> *Роза Айсина* >>>> Старший разработчик ПО >>>> *СберМаркет* | Доставка из любимых магазинов >>>> >>>> >>>> Email: roza.ays...@sbermarket.ru >>>> Mob: >>>> Web: sbermarket.ru >>>> App: iOS >>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>> и Android >>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>>> документы, приложенные к нему, содержат конфиденциальную информацию. >>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>>> Вам, использование, копирование, распространение информации, содержащейся в >>>> настоящем сообщении, а также осуществление любых действий на основе этой >>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>>> сообщение. >>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>>> confidential. If you are not the intended recipient you are notified that >>>> using, copying, distributing or taking any action in reliance on the >>>> contents of this information is strictly prohibited. If you have received >>>> this email in error please notify the sender and delete this email. >>>> *--* >>>> >>>> *Роза Айсина* >>>> Старший разработчик ПО >>>> *СберМаркет* | Доставка из любимых магазинов >>>> >>>> >>>> Email: roza.ays...@sbermarket.ru >>>> Mob: >>>> Web: sbermarket.ru >>>> App: iOS >>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>> и Android >>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>> >>>> >>>> >>>> >>>> >>>> *--* >>>> >>>> *Роза Айсина* >>>> Старший разработчик ПО >>>> *СберМаркет* | Доставка из любимых магазинов >>>> >>>> >>>> Email: roza.ays...@sbermarket.ru >>>> Mob: >>>> Web: sbermarket.ru >>>> App: iOS >>>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>>> и Android >>>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>>> документы, приложенные к нему, содержат конфиденциальную информацию. >>>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>>> Вам, использование, копирование, распространение информации, содержащейся в >>>> настоящем сообщении, а также осуществление любых действий на основе этой >>>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>>> сообщение. >>>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>>> confidential. If you are not the intended recipient you are notified that >>>> using, copying, distributing or taking any action in reliance on the >>>> contents of this information is strictly prohibited. If you have received >>>> this email in error please notify the sender and delete this email. >>>> >>> >>> >>> -- >>> Regards, >>> Sumit Deshinge >>> >>> >>> *--* >>> >>> *Роза Айсина* >>> >>> Старший разработчик ПО >>> >>> *СберМаркет* | Доставка из любимых магазинов >>> >>> >>> >>> Email: roza.ays...@sbermarket.ru >>> >>> Mob: >>> >>> Web: sbermarket.ru >>> >>> App: iOS >>> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> >>> и Android >>> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> >>> >>> >>> >>> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые >>> документы, приложенные к нему, содержат конфиденциальную информацию. >>> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено >>> Вам, использование, копирование, распространение информации, содержащейся в >>> настоящем сообщении, а также осуществление любых действий на основе этой >>> информации, строго запрещено. Если Вы получили это сообщение по ошибке, >>> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это >>> сообщение. >>> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are >>> confidential. If you are not the intended recipient you are notified that >>> using, copying, distributing or taking any action in reliance on the >>> contents of this information is strictly prohibited. If you have received >>> this email in error please notify the sender and delete this email. >>> >>