Re: Crash recover for Ignite persistence cluster (lost partitions case)

Sumit Deshinge Tue, 22 Nov 2022 06:45:13 -0800

Please check if this helps:
https://ignite.apache.org/docs/latest/configuring-caches/partition-loss-policy#handling-partition-loss
Also any reason baseline auto adjustment is disabled?


On Tue, Nov 22, 2022 at 6:38 PM Айсина Роза Мунеровна <
roza.ays...@sbermarket.ru> wrote:

> Hola again!
>
> I discovered that enabling graceful shutdown via does not work.
>
> In service logs I see that nothing happens when *SIGTERM* comes :(
> Eventually stopping action has been timed out and *SIGKILL* has been sent
> which causes ungraceful shutdown.
> Timeout is set to *10 minutes*.
>
> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Starting Apache Ignite
> In-Memory Computing Platform Service...
> Nov 22 12:27:23 yc-ignite-lab-02 systemd[1]: Started Apache Ignite
> In-Memory Computing Platform Service.
> Nov 22 12:29:25 yc-ignite-lab-02 systemd[1]: Stopping Apache Ignite
> In-Memory Computing Platform Service...
> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
> apache-ign...@config.xml.service: State 'stop-final-sigterm' timed out.
> Killing.
> Nov 22 12:39:25 yc-ignite-lab-02 systemd[1]:
> apache-ign...@config.xml.service: Killing process 11135 (java) with
> signal SIGKILL.
> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]:
> apache-ign...@config.xml.service: Failed with result 'timeout'.
> Nov 22 12:39:27 yc-ignite-lab-02 systemd[1]: Stopped Apache Ignite
> In-Memory Computing Platform Service.
>
>
> I also enabled *DEBUG* level and see that nothing happens after
> rebalancing started (this is the end of log):
>
> [2022-11-22T12:29:25,957][INFO ][shutdown-hook][G] Invoking shutdown
> hook...
> [2022-11-22T12:29:25,958][DEBUG][shutdown-hook][G] Shutdown is in progress
> (ignoring): Shutdown in progress
> [2022-11-22T12:29:25,959][INFO ][shutdown-hook][G] Ensuring that caches
> have sufficient backups and local rebalance completion...
>
>
> I forgot to add that service is tarted with *service.sh*, not *ignite.sh*
> .
>
> Please help!
>
> On 22 Nov 2022, at 1:17 PM, Айсина Роза Мунеровна <
> roza.ays...@sbermarket.ru> wrote:
>
> Hola!
> I have a problem recovering from cluster crash in case when persistence is
> enabled.
>
> Our setup is
> - 5 VM nodes with 40G Ram and 200GB disk,
> - persistence is enabled (on separate disk on each VM),
> - all cluster actions are made through Ansible playbooks,
> - all caches are either partitioned with backups = 1 or replicated,
> - cluster starts as the service with running ignite.sh,
> - baseline auto adjust is disabled.
>
> Also following the docs about partition loss policy I have added
> *-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true* to *JVM_OPTS* to wait until
> partition rebalancing.
>
> What problem we have: after shutting down several nodes (2 go 5) one after
> another exception about lost partitions is raised.
>
> *Caused by:
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute query because cache partition has been lostPart
> [cacheName=PUBLIC_StoreProductFeatures, part=512]*
>
> But in logs of dead nodes I see that all shutdown hooks are called as
> expected on both nodes:
>
> [2022-11-22T09:24:19,614][INFO ][shutdown-hook][G] Invoking shutdown
> hook...
> [2022-11-22T09:24:19,615][INFO ][shutdown-hook][G] Ensuring that caches
> have sufficient backups and local rebalance completion...
>
>
>
> And baseline topology looks like this (with 2 offline nodes as expected):
>
> Cluster state: active
> Current topology version: 23
> Baseline auto adjustment disabled: softTimeout=30000
>
> Current topology version: 23 (Coordinator:
> ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
> Order=3)
>
> Baseline nodes:
>     ConsistentId=1c6bad01-d187-40fa-ae9b-74023d080a8b, Address=172.17.0.1,
> State=ONLINE, Order=3
>     ConsistentId=4f67fccb-211b-4514-916b-a6286d1bb71b, Address=172.17.0.1,
> State=ONLINE, Order=21
>     ConsistentId=d980fa1c-e955-428a-bac9-d67dbfebb75e, Address=172.17.0.1,
> State=ONLINE, Order=5
>     ConsistentId=f151bd52-c173-45d7-952d-45cbe1d5fe97, State=OFFLINE
>     ConsistentId=f6862354-b175-4a0c-a94c-20253a944996, State=OFFLINE
>
> --------------------------------------------------------------------------------
> Number of baseline nodes: 5
>
> Other nodes not found.
>
>
>
> So my questions are:
>
> 1) What can I do to recover from partitions lost problem after shutting
> down several nodes? I thought that in case of graceful shutdown this
> problem must be solved.
>
> Now I can recover by returning *one* of offline nodes to cluster
> (starting the service) and running *reset_lost_partitions* command for
> broken cache. After this cache becomes available.
>
> 2) What can I do to prevent this problem in scenario with automatic
> cluster deployment? Should I add *reset_lost_partitions* command after
> activation or redeploy?
>
> Please help.
> Thanks in advance!
>
> Best regards,
> Rose.
>
> *--*
>
> *Роза Айсина*
> Старший разработчик ПО
> *СберМаркет* | Доставка из любимых магазинов
>
>
> Email: roza.ays...@sbermarket.ru
> Mob:
> Web: sbermarket.ru
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
>
>
>
>
> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
> документы, приложенные к нему, содержат конфиденциальную информацию.
> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
> Вам, использование, копирование, распространение информации, содержащейся в
> настоящем сообщении, а также осуществление любых действий на основе этой
> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
> сообщение.
> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
> confidential. If you are not the intended recipient you are notified that
> using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
> *--*
>
> *Роза Айсина*
> Старший разработчик ПО
> *СберМаркет* | Доставка из любимых магазинов
>
>
> Email: roza.ays...@sbermarket.ru
> Mob:
> Web: sbermarket.ru
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
>
>
> *--*
>
> *Роза Айсина*
>
> Старший разработчик ПО
>
> *СберМаркет* | Доставка из любимых магазинов
>
>
>
> Email: roza.ays...@sbermarket.ru
>
> Mob:
>
> Web: sbermarket.ru
>
> App: iOS
> <https://apps.apple.com/ru/app/%D1%81%D0%B1%D0%B5%D1%80%D0%BC%D0%B0%D1%80%D0%BA%D0%B5%D1%82-%D0%B4%D0%BE%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0-%D0%BF%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D0%BE%D0%B2/id1166642457>
> и Android
> <https://play.google.com/store/apps/details?id=ru.instamart&hl=en&gl=ru>
>
>
>
> *УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ:* это электронное сообщение и любые
> документы, приложенные к нему, содержат конфиденциальную информацию.
> Настоящим уведомляем Вас о том, что, если это сообщение не предназначено
> Вам, использование, копирование, распространение информации, содержащейся в
> настоящем сообщении, а также осуществление любых действий на основе этой
> информации, строго запрещено. Если Вы получили это сообщение по ошибке,
> пожалуйста, сообщите об этом отправителю по электронной почте и удалите это
> сообщение.
> *CONFIDENTIALITY NOTICE:* This email and any files attached to it are
> confidential. If you are not the intended recipient you are notified that
> using, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited. If you have received
> this email in error please notify the sender and delete this email.
>


-- 
Regards,
Sumit Deshinge

Re: Crash recover for Ignite persistence cluster (lost partitions case)

Reply via email to