Please check if this helps: https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss Also any reason baseline auto adjustment is disabled?
On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна < roza.ays...@sbermarket.ru> wrote: > Hola again! > > I discovered that enabling graceful shutdown via does not work. > > In service logs I see that nothing happens when *SIGTERM* comes :( > Eventually stopping action has been timed out and *SIGKILL* has been sent > which causes ungraceful shutdown. > Timeout is set to *10 minutes*. > > Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite > In-Memory Computing Platform Service... > Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite > In-Memory Computing Platform Service. > Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite > In-Memory Computing Platform Service... > Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: > apache-ign...@config.xml.service: State 'stop-final-sigterm' timed out. > Killing. > Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]: > apache-ign...@config.xml.service: Killing process 11135 (java) with > signal SIGKILL. > Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: > apache-ign...@config.xml.service: Failed with result 'timeout'. > Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite > In-Memory Computing Platform Service. > > > I also enabled *DEBUG* level and see that nothing happens after > rebalancing started (this is the end of log): > > [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown > hook... > [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress > (ignoring): Shutdown in progress > [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches > have sufficient backups and local rebalance completion... > > > I forgot to add that service is tarted with *service.sh*, not *ignite.sh* > . > > Please help! > > On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна < > roza.ays...@sbermarket.ru> wrote: > > Hola! > I have a problem recovering from cluster crash in case when persistence is > enabled. > > Our setup is > - 5 VM nodes with 40G Ram and 200GB disk, > - persistence is enabled (on separate disk on each VM), > - all cluster actions are made through Ansible playbooks, > - all caches are either partitioned with backups = 1 or replicated, > - cluster starts as the service with running ignite.sh, > - baseline auto adjust is disabled. > > Also following the docs about partition loss policy I have added > *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait until > partition rebalancing. > > What problem we have: after shutting down several nodes (2 go 5) one after > another exception about lost partitions is raised. > > *Caused by: > org.apache.ignite.internal.processors.cache.CacheInvalidStateException: > Failed to execute query because cache partition has been lostPart > [cacheName=PUBLIC_StoreProductFeatures, part=512]* > > But in logs of dead nodes I see that all shutdown hooks are called as > expected on both nodes: > > [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown > hook... > [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches > have sufficient backups and local rebalance completion... > > > > And baseline topology looks like this (with 2 offline nodes as expected): > > Cluster state: active > Current topology version: 23 > Baseline auto adjustment disabled: softTimeout=30000 > > Current topology version: 23 (Coordinator: > ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, > Order=3) > > Baseline nodes: > ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1, > State=ONLINE, Order=3 > ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1, > State=ONLINE, Order=21 > ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1, > State=ONLINE, Order=5 > ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE > ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE > > -------------------------------------------------------------------------------- > Number of baseline nodes: 5 > > Other nodes not found. > > > > So my questions are: > > 1) What can I do to recover from partitions lost problem after shutting > down several nodes? I thought that in case of graceful shutdown this > problem must be solved. > > Now I can recover by returning *one* of offline nodes to cluster > (starting the service) and running *reset_lost_partitions* command for > broken cache. After this cache becomes available. > > 2) What can I do to prevent this problem in scenario with automatic > cluster deployment? Should I add *reset_lost_partitions* command after > activation or redeploy? > > Please help. > Thanks in advance! > > Best regards, > Rose. > > *--* > > *Роза Айсина* > Старший разработчик ПО > *СберМаркет* | Доставка из любимых магазинов > > > Email: roza.ays...@sbermarket.ru > Mob: > Web: sbermarket.ru > App: iOS > <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> > и Android > <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> > > > > > > > > *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые > документы, приложенные к нему, содержат конфиденциальную информацию. > Настоящим уведомляем Вас о том, что, если это сообщение не предназначено > Вам, использование, копирование, распространение информации, содержащейся в > настоящем сообщении, а также осуществление любых действий на основе этой > информации, строго запрещено. Если Вы получили это сообщение по ошибке, > пожалуйста, сообщите об этом отправителю по электронной почте и удалите это > сообщение. > *CONFIDENTIALITY NOTICE:* This email and any files attached to it are > confidential. If you are not the intended recipient you are notified that > using, copying, distributing or taking any action in reliance on the > contents of this information is strictly prohibited. If you have received > this email in error please notify the sender and delete this email. > *--* > > *Роза Айсина* > Старший разработчик ПО > *СберМаркет* | Доставка из любимых магазинов > > > Email: roza.ays...@sbermarket.ru > Mob: > Web: sbermarket.ru > App: iOS > <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> > и Android > <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> > > > > > > *--* > > *Роза Айсина* > > Старший разработчик ПО > > *СберМаркет* | Доставка из любимых магазинов > > > > Email: roza.ays...@sbermarket.ru > > Mob: > > Web: sbermarket.ru > > App: iOS > <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457> > и Android > <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru> > > > > *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые > документы, приложенные к нему, содержат конфиденциальную информацию. > Настоящим уведомляем Вас о том, что, если это сообщение не предназначено > Вам, использование, копирование, распространение информации, содержащейся в > настоящем сообщении, а также осуществление любых действий на основе этой > информации, строго запрещено. Если Вы получили это сообщение по ошибке, > пожалуйста, сообщите об этом отправителю по электронной почте и удалите это > сообщение. > *CONFIDENTIALITY NOTICE:* This email and any files attached to it are > confidential. If you are not the intended recipient you are notified that > using, copying, distributing or taking any action in reliance on the > contents of this information is strictly prohibited. If you have received > this email in error please notify the sender and delete this email. > -- Regards, Sumit Deshinge