Re: IEP-14: Ignite failures handling (Discussion)
Yakov, I agree with Andrey that a separate abstraction for failure handling makes sense. First, using event listeners for this kind of response allows users to install multiple listeners, which may be invoked in an unpredictable order, this looks error-prone to me. Second, we may add an additional methods to failure handlers in future releases (say, Ignite 3.00), so it is better to have a separate interface right away. I do not mind adding a separate event for this, though, but the event should be used for notifications, not to run any reaction code. --AG 2018-03-23 22:27 GMT+03:00 Yakov Zhdanov: > Andrey, I understand your point but you are trying to build one more > mechanism and introduce abstractions that are already here. Again, please > take a look at segmentation policy and event types we already have. > > Thanks! > > Yakov >
Re: IEP-14: Ignite failures handling (Discussion)
Andrey, I understand your point but you are trying to build one more mechanism and introduce abstractions that are already here. Again, please take a look at segmentation policy and event types we already have. Thanks! Yakov
Re: IEP-14: Ignite failures handling (Discussion)
Yakov, DiscoveryWorker is critical worker itself and could be terminated or blocked by user provided listener. So specific abstraction for failure handling is more robust way to solve the problem because it doesn't dependent on other components. On Tue, Mar 20, 2018 at 1:33 PM, Yakov Zhdanovwrote: > If java runs oome then you cannot guarantee anything. Including calling > runtime.halt(). > > My point is about consistent approach throughout the project. I think > developing new mechanism with separate interface is incorrect. > > Yakov
Re: IEP-14: Ignite failures handling (Discussion)
If java runs oome then you cannot guarantee anything. Including calling runtime.halt(). My point is about consistent approach throughout the project. I think developing new mechanism with separate interface is incorrect. Yakov
Re: IEP-14: Ignite failures handling (Discussion)
On Mon, Mar 19, 2018 at 2:24 PM, Yakov Zhdanovwrote: > Andrey Gura, > > Why should we have any FailureHandler abstraction? We already have it - > this is EventListener. In my view it is better (and cleaner design) to add > events (similar to, for > example, org.apache.ignite.events.EventType#EVT_NODE_SEGMENTED) like > EVT_IGNITE_OOME, EVT_SYS_WORKER_FAILED and fire events accordingly to the > situation + execute configured system logic. We have exactly same way with > segmentation. We have policy which defines how system reacts and also allow > user to add event listeners. > Yakov, how would it be possible to fire the events if Ignite is not in operational state? For example, what can a user do if the Java application ran out of memory?
Re: IEP-14: Ignite failures handling (Discussion)
Andrey Gura, Why should we have any FailureHandler abstraction? We already have it - this is EventListener. In my view it is better (and cleaner design) to add events (similar to, for example, org.apache.ignite.events.EventType#EVT_NODE_SEGMENTED) like EVT_IGNITE_OOME, EVT_SYS_WORKER_FAILED and fire events accordingly to the situation + execute configured system logic. We have exactly same way with segmentation. We have policy which defines how system reacts and also allow user to add event listeners. For better understanding please take a look at org.apache.ignite.plugin.segmentation.SegmentationPolicy and org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.DiscoveryWorker#onSegmentation. Discovery manager records the event (allowing user to get notification on it) and executes internal logic in case segmentation policy is not NOOP. Thanks! --Yakov
Re: IEP-14: Ignite failures handling (Discussion)
Thanks Andrey! I have added a few comments to the IEP-14 page. D. On Fri, Mar 16, 2018 at 6:44 AM, Andrey Gurawrote: > Hi! > > Thank you all for your opinions and ideas! > > While reading the thread I made two important conclusions: > > 1. Proposed API should be changed because possible actions enumeration > is bad idea. More clean and simple design should allow user provide > failure handler implementation with custom logic of failure handling > if needed. > > 2. Several failure handler implementations should be provided out-of > box in order to provide simple way of changing default behaviour > through configuration. The following implementations should be > provided: > > - NoOpFailureHandler - It's useful for tests and debugging. > - RestartProcessFailureHandler - Specific implementation that > could be used only with ignite.(sh|bat). > - StopNodeFailureHandler - This implementation will stop Ignite > node in case of critical error. > - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) - > Default failure handler will try stop node if tryStop value is true. > If node can't be stopped or tryStop value is false then JVM process > will be terminated forcibly (Runtime.halt()). Default value for > tryStop parameter is false. Of course we should limit time of node > shutdown in order to prevent hangs. > > As for the default behavior, I agree with those who believe that most > suitable default option is process termination (although I had a > different opinion before) and most strong argument for this choice is > impossibility of reasoning about system state in case of critical > error. > Also I believe that we can't choose solution that will be suitable for > any community member and the best that we can do is provide simple way > of changing this behavior. > > So, I think, default behavior discussion should be finished. I'll > update IEP-14 [1] accordingly to my conclusions above. If you have any > ideas or thoughts about this conclusions, please feel free to share. > > Thanks! > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 14+Ignite+failures+handling > > On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyan > wrote: > > On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov > > wrote: > > > >> Hi Dmitriy, > >> > >> It seems, here everyone agrees that killing the process will give a more > >> guaranteed result. The question is that the majority in the community > does > >> not consider this to be acceptable in case Ignite as started as embedded > >> lib (e.g. from Java, using Ignition.start()) > >> > >> What can help to accept the community's opinion? Let's remember Apache > >> principle: "community first". > >> > > > > I am still confused about the problem the majority of the community is > > trying to solve. If our priority is to keep the cluster in frozen state, > > then what is the reason for this task altogether? > > > > The priority should be to keep the cluster operational, not frozen. The > > only solution here is "kill" or "stop+kill". If the community does not > > accept this option as a default, then I propose to drop this task > > altogether, because we do not have to do anything to keep the cluster > > frozen. > > > > > >> If release 2.5 will show us it was inpractical, we will change default > to > >> kill even for library. What do you think? > >> > > > > See above. I do not see a reason to continue with this task if the end > > result is identical to what we have today. > > > > I want to give the community another chance to speak up and voice their > > opinions again, having fully understood the context and the problem being > > solved here. > > > > D. >
Re: IEP-14: Ignite failures handling (Discussion)
Hi! Thank you all for your opinions and ideas! While reading the thread I made two important conclusions: 1. Proposed API should be changed because possible actions enumeration is bad idea. More clean and simple design should allow user provide failure handler implementation with custom logic of failure handling if needed. 2. Several failure handler implementations should be provided out-of box in order to provide simple way of changing default behaviour through configuration. The following implementations should be provided: - NoOpFailureHandler - It's useful for tests and debugging. - RestartProcessFailureHandler - Specific implementation that could be used only with ignite.(sh|bat). - StopNodeFailureHandler - This implementation will stop Ignite node in case of critical error. - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) - Default failure handler will try stop node if tryStop value is true. If node can't be stopped or tryStop value is false then JVM process will be terminated forcibly (Runtime.halt()). Default value for tryStop parameter is false. Of course we should limit time of node shutdown in order to prevent hangs. As for the default behavior, I agree with those who believe that most suitable default option is process termination (although I had a different opinion before) and most strong argument for this choice is impossibility of reasoning about system state in case of critical error. Also I believe that we can't choose solution that will be suitable for any community member and the best that we can do is provide simple way of changing this behavior. So, I think, default behavior discussion should be finished. I'll update IEP-14 [1] accordingly to my conclusions above. If you have any ideas or thoughts about this conclusions, please feel free to share. Thanks! [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyanwrote: > On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov > wrote: > >> Hi Dmitriy, >> >> It seems, here everyone agrees that killing the process will give a more >> guaranteed result. The question is that the majority in the community does >> not consider this to be acceptable in case Ignite as started as embedded >> lib (e.g. from Java, using Ignition.start()) >> >> What can help to accept the community's opinion? Let's remember Apache >> principle: "community first". >> > > I am still confused about the problem the majority of the community is > trying to solve. If our priority is to keep the cluster in frozen state, > then what is the reason for this task altogether? > > The priority should be to keep the cluster operational, not frozen. The > only solution here is "kill" or "stop+kill". If the community does not > accept this option as a default, then I propose to drop this task > altogether, because we do not have to do anything to keep the cluster > frozen. > > >> If release 2.5 will show us it was inpractical, we will change default to >> kill even for library. What do you think? >> > > See above. I do not see a reason to continue with this task if the end > result is identical to what we have today. > > I want to give the community another chance to speak up and voice their > opinions again, having fully understood the context and the problem being > solved here. > > D.
Re: IEP-14: Ignite failures handling (Discussion)
On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlovwrote: > Hi Dmitriy, > > It seems, here everyone agrees that killing the process will give a more > guaranteed result. The question is that the majority in the community does > not consider this to be acceptable in case Ignite as started as embedded > lib (e.g. from Java, using Ignition.start()) > > What can help to accept the community's opinion? Let's remember Apache > principle: "community first". > I am still confused about the problem the majority of the community is trying to solve. If our priority is to keep the cluster in frozen state, then what is the reason for this task altogether? The priority should be to keep the cluster operational, not frozen. The only solution here is "kill" or "stop+kill". If the community does not accept this option as a default, then I propose to drop this task altogether, because we do not have to do anything to keep the cluster frozen. > If release 2.5 will show us it was inpractical, we will change default to > kill even for library. What do you think? > See above. I do not see a reason to continue with this task if the end result is identical to what we have today. I want to give the community another chance to speak up and voice their opinions again, having fully understood the context and the problem being solved here. D.
Re: IEP-14: Ignite failures handling (Discussion)
Hi Dmitriy, It seems, here everyone agrees that killing the process will give a more guaranteed result. The question is that the majority in the community does not consider this to be acceptable in case Ignite as started as embedded lib (e.g. from Java, using Ignition.start()) What can help to accept the community's opinion? Let's remember Apache principle: "community first". If release 2.5 will show us it was inpractical, we will change default to kill even for library. What do you think? Sincerely, Dmitriy Pavlov чт, 15 мар. 2018 г. в 5:48, Dmitriy Setrakyan: > On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev > wrote: > > > I'm not disagreeing with you, Dmitriy. > > > > What I'm trying to say is that if we assume that a serious enough bug or > > some environmental issue prevents Ignite node from functioning correctly, > > then it's only logical to assume that Ignite process is completely hosed > > (for example, due to a very very long STW pause) and can't make any > > progress at all. In a situation like this the application can't reason > > about the process state, and the process itself may not be able to even > > kill itself. The only reliable way to handle cases like that is to have > an > > external observer (a health monitoring tool) that is not itself affected > by > > the bug or the env issue and can either make a decision by itself or > send a > > notification to the SRE team. > > > > Agree about the external observers, but that is something a user should do > outside of Ignite. > > > > In my previous post I only suggest to go easy on the "cleverness" of the > > self-monitoring implementation as IMHO it won't be used much in > production > > environment. I think Ignite as it is already provides sufficient means > > of monitoring its health (they may or may not be robust enough, which is > a > > different issue). > > > > The approach I am suggesting is pretty simple - "kill" the process in case > of a critical error. The only intelligence I would like to add is to > attempt shutting down the Ignite node gracefully before the "kill" is > executed. If a node is shutdown gracefully, then the restart procedure will > be faster, so it is worthwhile to try. > > Some of the critical errors include running out of disk, memory, loosing > Ignite system threads, etc... These errors are truly unrecoverable from the > application stand point and should mostly be handled with a process restart > anyway. > > D. >
Re: IEP-14: Ignite failures handling (Discussion)
On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornevwrote: > I'm not disagreeing with you, Dmitriy. > > What I'm trying to say is that if we assume that a serious enough bug or > some environmental issue prevents Ignite node from functioning correctly, > then it's only logical to assume that Ignite process is completely hosed > (for example, due to a very very long STW pause) and can't make any > progress at all. In a situation like this the application can't reason > about the process state, and the process itself may not be able to even > kill itself. The only reliable way to handle cases like that is to have an > external observer (a health monitoring tool) that is not itself affected by > the bug or the env issue and can either make a decision by itself or send a > notification to the SRE team. > Agree about the external observers, but that is something a user should do outside of Ignite. > In my previous post I only suggest to go easy on the "cleverness" of the > self-monitoring implementation as IMHO it won't be used much in production > environment. I think Ignite as it is already provides sufficient means > of monitoring its health (they may or may not be robust enough, which is a > different issue). > The approach I am suggesting is pretty simple - "kill" the process in case of a critical error. The only intelligence I would like to add is to attempt shutting down the Ignite node gracefully before the "kill" is executed. If a node is shutdown gracefully, then the restart procedure will be faster, so it is worthwhile to try. Some of the critical errors include running out of disk, memory, loosing Ignite system threads, etc... These errors are truly unrecoverable from the application stand point and should mostly be handled with a process restart anyway. D.
Re: IEP-14: Ignite failures handling (Discussion)
I'm not disagreeing with you, Dmitriy. What I'm trying to say is that if we assume that a serious enough bug or some environmental issue prevents Ignite node from functioning correctly, then it's only logical to assume that Ignite process is completely hosed (for example, due to a very very long STW pause) and can't make any progress at all. In a situation like this the application can't reason about the process state, and the process itself may not be able to even kill itself. The only reliable way to handle cases like that is to have an external observer (a health monitoring tool) that is not itself affected by the bug or the env issue and can either make a decision by itself or send a notification to the SRE team. In my previous post I only suggest to go easy on the "cleverness" of the self-monitoring implementation as IMHO it won't be used much in production environment. I think Ignite as it is already provides sufficient means of monitoring its health (they may or may not be robust enough, which is a different issue). Regards Andrey From: Dmitriy Setrakyan <dsetrak...@apache.org> Sent: Wednesday, March 14, 2018 6:22 PM To: dev@ignite.apache.org Subject: Re: IEP-14: Ignite failures handling (Discussion) On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <andrewkor...@hotmail.com> wrote: > If I were the one responsible for running Ignite-based applications (be it > embedded or standalone Ignite) in my company's datacenter, I'd prefer the > application nodes simply make their current state readily available to > external tools (via JMX, health checks, etc.) and leave the decision of > when to die and when to continue to run up to me. The last thing I need in > production is a too clever an application that decides to kill itself based > on its local (perhaps confused) state. > > Usually SRE teams build all sorts of technology-specific tools to monitor > health of the applications and they like to be as much in control as > possible when it comes to killing processes. > > I guess what I'm saying is this: keep things simple. Do not over engineer. > In real production environments the companies will most likely have this > feature disabled (I know I would) and instead rely on their own tooling for > handling failures. > > Andrey, our priority should be to keep the cluster operational. If a frozen Ignite node is kept around, the whole cluster becomes un-operational. I bet this is not what you would prefer in production either. However, if we kill the process, then the cluster should continue to operate. We are talking about a distributed system in which a failure of one node should not matter. If we want to keep this promise to the users, then we must kill the process if Ignite node freezes. Also, keep in mind that we are talking about the "default" behavior. If you are not happy with the "default" mode, then you will be able to configure other behaviors, like keeping the frozen Ignite node around, if you like. D.
Re: IEP-14: Ignite failures handling (Discussion)
On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornevwrote: > If I were the one responsible for running Ignite-based applications (be it > embedded or standalone Ignite) in my company's datacenter, I'd prefer the > application nodes simply make their current state readily available to > external tools (via JMX, health checks, etc.) and leave the decision of > when to die and when to continue to run up to me. The last thing I need in > production is a too clever an application that decides to kill itself based > on its local (perhaps confused) state. > > Usually SRE teams build all sorts of technology-specific tools to monitor > health of the applications and they like to be as much in control as > possible when it comes to killing processes. > > I guess what I'm saying is this: keep things simple. Do not over engineer. > In real production environments the companies will most likely have this > feature disabled (I know I would) and instead rely on their own tooling for > handling failures. > > Andrey, our priority should be to keep the cluster operational. If a frozen Ignite node is kept around, the whole cluster becomes un-operational. I bet this is not what you would prefer in production either. However, if we kill the process, then the cluster should continue to operate. We are talking about a distributed system in which a failure of one node should not matter. If we want to keep this promise to the users, then we must kill the process if Ignite node freezes. Also, keep in mind that we are talking about the "default" behavior. If you are not happy with the "default" mode, then you will be able to configure other behaviors, like keeping the frozen Ignite node around, if you like. D.
Re: IEP-14: Ignite failures handling (Discussion)
On Tue, Mar 13, 2018 at 11:17 PM, Nick Pordashwrote: > I can tell you as a user that if any library I was using in my application > called System.exit without my consent would result in a lot of frustration. > > If ignite enters an unrecoverable state then I think that is something that > should be observable locally, similar to node segmentation and then the > application can decide the best course of action. > Nick, you would be a lot more frustrated if Ignite was frozen and every call to Ignite would freeze the application threads as well. Again, if you prefer to keep the process around, even if Ignite freezes, then you can always configure this behavior, but I still believe that the default should be to kill the process. Ignite is a horizontally scalable system, so killing of one node should not be a significant event and should not disrupt the cluster. However, a freeze of one node is a significant event and can bring the whole cluster to a halt. D.
Re: IEP-14: Ignite failures handling (Discussion)
As far as shutdown, what we need to implement is “hard shutdown” mode. This is when we first close all network sockets, then cancel all registered futures. This would enough to unblock the cluster and local user threads. ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov: > Valya, > > This is very easy to answer - if CommandLineStartup is used, then it is > standalone node. In all other cases it is embedded. > > If node shutdown hangs - just let it continue hanging, so that application > admins are able to decide on their own what to do next. Someone would want > to get the stack trace, others would decide to restart outside of business > hours (e.g. because Ignite is used only in part of their application), > someone else would try to shutdown gracefully other components before > stopping the process to minimize negative impact, etc. > > I am quite understand why are we guessing here how embedded Ignite is > used. It could be used in any way and in any combination with other > frameworks. Process stop by default is simply not an option. > > ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko < > valentin.kuliche...@gmail.com>: > >> Ivan, >> >> If grid hangs, graceful shutdown would most likely hang as well. Almost >> never you can recover from a bad state using graceful procedures. >> >> I agree that we should not create two defaults, especially in this case. >> It's not even strictly defined what is embedded node in Ignite. For >> example, if I start it using a custom main class and/or custom script >> instead of ignite.sh, would it be embedded or standalone node? >> >> -Val >> >> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov >> wrote: >> >> > One more note: "kill if standalone, stop if embedded" differs from what >> > you are suggesting "try graceful, then kill process regardless" only in >> > case when graceful shutdown hangs. >> > Do we have understanding, how often does graceful shutdown hang? >> > Obviously, *grid hang* is often case, but it shouldn't be messed with >> > *graceful shutdown hang*. From my experience, if something went wrong, >> > users just prefer to do kill -9 because it's much more reliable and >> easy. >> > Probably, in most of cases when kill -9 worked, graceful stop would have >> > worked as well - we just don't have such statistics. >> > It may be bad example, but: in our CI tests we intentionally break grid >> in >> > many harsh ways and perform a graceful stop after the test execution, >> and >> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite >> > hangs. >> > >> > Best Regards, >> > Ivan Rakov >> > >> > >> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote: >> > >> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov >> >> wrote: >> >> >> >> I just would like to add my +1 for "kill if standalone, stop if >> embedded" >> >>> default option. My arguments: >> >>> >> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop": >> >>> Unfortunately, it's true that Ignite can hang during stop procedure. >> >>> However, most of failures described under IEP-14 (storage IO >> exceptions, >> >>> death of critical system worker thread, etc) normally shouldn't turn >> node >> >>> into "impossible to stop" state. Turning into that state is a bug >> >>> itself. I >> >>> guess that we shouldn't choose system behavior on the basis of known >> >>> bugs. >> >>> >> >> >> >> The whole discussion is about protecting against force-major issues, >> >> including Ignite bugs. You are assuming that a user application will >> >> somehow continue to function if an Ignite node is stopped. In most >> cases >> >> it >> >> will just freeze itself and cause the rest of the application to hang. >> >> >> >> Again, "kill+stop" is the most deterministic and the safest default >> >> behavior. Try a graceful shutdown (which will make restart easier), and >> >> then kill the process regardless. >> >> >> >> Note that we are arguing about the default behavior. If a user does not >> >> like this default, then this user can change it to another behavior. >> >> >> >> >> >> 2) User might want to handle Ignite node crash before shutting down the >> >>> whole JVM - raise alert, close external resources, etc >> >>> >> >>> Very unlikely, but if a user is this advanced, then this user can >> change >> >> the default behavior. Most users will not even know how to configure >> such >> >> custom shutdown behavior and would prefer an automatic kill. >> >> >> >> 3) IEP-14 document has important notes: "More than one Ignite node >> could >> >> be >> >> >> >>> started in one JVM process" and "Different nodes in one JVM process >> could >> >>> belong to different clusters". This is possible only in embedded >> mode. I >> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along >> with >> >>> another healthy nodes) if there's a chance of successful node stop. >> >>> >> >>> Has anyone actually seen a real example of that? I have
Re: IEP-14: Ignite failures handling (Discussion)
Valya, This is very easy to answer - if CommandLineStartup is used, then it is standalone node. In all other cases it is embedded. If node shutdown hangs - just let it continue hanging, so that application admins are able to decide on their own what to do next. Someone would want to get the stack trace, others would decide to restart outside of business hours (e.g. because Ignite is used only in part of their application), someone else would try to shutdown gracefully other components before stopping the process to minimize negative impact, etc. I am quite understand why are we guessing here how embedded Ignite is used. It could be used in any way and in any combination with other frameworks. Process stop by default is simply not an option. ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko < valentin.kuliche...@gmail.com>: > Ivan, > > If grid hangs, graceful shutdown would most likely hang as well. Almost > never you can recover from a bad state using graceful procedures. > > I agree that we should not create two defaults, especially in this case. > It's not even strictly defined what is embedded node in Ignite. For > example, if I start it using a custom main class and/or custom script > instead of ignite.sh, would it be embedded or standalone node? > > -Val > > On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakovwrote: > > > One more note: "kill if standalone, stop if embedded" differs from what > > you are suggesting "try graceful, then kill process regardless" only in > > case when graceful shutdown hangs. > > Do we have understanding, how often does graceful shutdown hang? > > Obviously, *grid hang* is often case, but it shouldn't be messed with > > *graceful shutdown hang*. From my experience, if something went wrong, > > users just prefer to do kill -9 because it's much more reliable and > easy. > > Probably, in most of cases when kill -9 worked, graceful stop would have > > worked as well - we just don't have such statistics. > > It may be bad example, but: in our CI tests we intentionally break grid > in > > many harsh ways and perform a graceful stop after the test execution, and > > it doesn't hang - otherwise we'd see many "Execution timeout" test suite > > hangs. > > > > Best Regards, > > Ivan Rakov > > > > > > On 14.03.2018 2:24, Dmitriy Setrakyan wrote: > > > >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov > >> wrote: > >> > >> I just would like to add my +1 for "kill if standalone, stop if > embedded" > >>> default option. My arguments: > >>> > >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop": > >>> Unfortunately, it's true that Ignite can hang during stop procedure. > >>> However, most of failures described under IEP-14 (storage IO > exceptions, > >>> death of critical system worker thread, etc) normally shouldn't turn > node > >>> into "impossible to stop" state. Turning into that state is a bug > >>> itself. I > >>> guess that we shouldn't choose system behavior on the basis of known > >>> bugs. > >>> > >> > >> The whole discussion is about protecting against force-major issues, > >> including Ignite bugs. You are assuming that a user application will > >> somehow continue to function if an Ignite node is stopped. In most cases > >> it > >> will just freeze itself and cause the rest of the application to hang. > >> > >> Again, "kill+stop" is the most deterministic and the safest default > >> behavior. Try a graceful shutdown (which will make restart easier), and > >> then kill the process regardless. > >> > >> Note that we are arguing about the default behavior. If a user does not > >> like this default, then this user can change it to another behavior. > >> > >> > >> 2) User might want to handle Ignite node crash before shutting down the > >>> whole JVM - raise alert, close external resources, etc > >>> > >>> Very unlikely, but if a user is this advanced, then this user can > change > >> the default behavior. Most users will not even know how to configure > such > >> custom shutdown behavior and would prefer an automatic kill. > >> > >> 3) IEP-14 document has important notes: "More than one Ignite node could > >> be > >> > >>> started in one JVM process" and "Different nodes in one JVM process > could > >>> belong to different clusters". This is possible only in embedded mode. > I > >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with > >>> another healthy nodes) if there's a chance of successful node stop. > >>> > >>> Has anyone actually seen a real example of that? I have not. This > >> scenario > >> is extremely unlikely and should not define the default behavior. Again, > >> if > >> a user is so advanced to come up with such a sophisticated deployment, > >> then > >> the same user should be able to set different default behaviors for > >> different clusters. > >> > >> > > >
Re: IEP-14: Ignite failures handling (Discussion)
Dmitriy. I think you and other participants of discussion are talking about different cases. May be it be usefull to look at specific cases and discuss each of them separately? I look at IEP page and see following: ``` File IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical: * WAL * Page store * Meta store * Binary meta store ``` Suppose, we ran out of disk space on some node. The other things are all right. Should we do `System.exit(-1);` in that case? Personally, I fully agreed with Nick Podrash: "I can tell you as a user that if any library I was using in my application called System.exit without my consent would result in a lot of frustration." Also, do you have any examples of other products that do `System.exit(-1);` in case of troubles? В Вт, 13/03/2018 в 19:07 -0400, Dmitriy Setrakyan пишет: > On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlov> wrote: > > > What do you think if stop is default for all cases? > > > > Kill is configurable. > > > > We can consider enforse sockets close for 'stop'. This will allow to ignore > > hang node by rest of the cluster. > > > > Dmitriy, I see that you cannot come to terms with stopping a process that > was not started by Ignite. However, in majority of the deployments, users > would prefer that you would "kill" the process instead of leaving it > running in a "frozen" state. Frozen state is non-deterministic and it is > impossible to create a recovery for it. Killing the process is very > deterministic and can be recovered by restarting it in most cases. > > "stop" does not fix the problem we are trying to solve. The whole point is > to prevent frozen state, and "stop" without "kill" does not prevent it. I > am OK if "stop+kill" is the default behavior, which means that we try a > graceful shutdown and then always kill the process anyway. > > I think we should have the following configurable options: > - "stop+kill" (default) > - "kill" > - "stop" > - "stop+restart" (if stop fails, we should kill regardless) signature.asc Description: This is a digitally signed message part
Re: IEP-14: Ignite failures handling (Discussion)
I can tell you as a user that if any library I was using in my application called System.exit without my consent would result in a lot of frustration. If ignite enters an unrecoverable state then I think that is something that should be observable locally, similar to node segmentation and then the application can decide the best course of action. Of course, if ignite was started as a standalone process do what you think is best, but don't think you can kill the process without backlash from users if it's running in embedded mode. - Nick On Tue, Mar 13, 2018, 5:12 PM Valentin Kulichenko < valentin.kuliche...@gmail.com> wrote: > Ivan, > > If grid hangs, graceful shutdown would most likely hang as well. Almost > never you can recover from a bad state using graceful procedures. > > I agree that we should not create two defaults, especially in this case. > It's not even strictly defined what is embedded node in Ignite. For > example, if I start it using a custom main class and/or custom script > instead of ignite.sh, would it be embedded or standalone node? > > -Val > > On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakovwrote: > > > One more note: "kill if standalone, stop if embedded" differs from what > > you are suggesting "try graceful, then kill process regardless" only in > > case when graceful shutdown hangs. > > Do we have understanding, how often does graceful shutdown hang? > > Obviously, *grid hang* is often case, but it shouldn't be messed with > > *graceful shutdown hang*. From my experience, if something went wrong, > > users just prefer to do kill -9 because it's much more reliable and > easy. > > Probably, in most of cases when kill -9 worked, graceful stop would have > > worked as well - we just don't have such statistics. > > It may be bad example, but: in our CI tests we intentionally break grid > in > > many harsh ways and perform a graceful stop after the test execution, and > > it doesn't hang - otherwise we'd see many "Execution timeout" test suite > > hangs. > > > > Best Regards, > > Ivan Rakov > > > > > > On 14.03.2018 2:24, Dmitriy Setrakyan wrote: > > > >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov > >> wrote: > >> > >> I just would like to add my +1 for "kill if standalone, stop if > embedded" > >>> default option. My arguments: > >>> > >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop": > >>> Unfortunately, it's true that Ignite can hang during stop procedure. > >>> However, most of failures described under IEP-14 (storage IO > exceptions, > >>> death of critical system worker thread, etc) normally shouldn't turn > node > >>> into "impossible to stop" state. Turning into that state is a bug > >>> itself. I > >>> guess that we shouldn't choose system behavior on the basis of known > >>> bugs. > >>> > >> > >> The whole discussion is about protecting against force-major issues, > >> including Ignite bugs. You are assuming that a user application will > >> somehow continue to function if an Ignite node is stopped. In most cases > >> it > >> will just freeze itself and cause the rest of the application to hang. > >> > >> Again, "kill+stop" is the most deterministic and the safest default > >> behavior. Try a graceful shutdown (which will make restart easier), and > >> then kill the process regardless. > >> > >> Note that we are arguing about the default behavior. If a user does not > >> like this default, then this user can change it to another behavior. > >> > >> > >> 2) User might want to handle Ignite node crash before shutting down the > >>> whole JVM - raise alert, close external resources, etc > >>> > >>> Very unlikely, but if a user is this advanced, then this user can > change > >> the default behavior. Most users will not even know how to configure > such > >> custom shutdown behavior and would prefer an automatic kill. > >> > >> 3) IEP-14 document has important notes: "More than one Ignite node could > >> be > >> > >>> started in one JVM process" and "Different nodes in one JVM process > could > >>> belong to different clusters". This is possible only in embedded mode. > I > >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with > >>> another healthy nodes) if there's a chance of successful node stop. > >>> > >>> Has anyone actually seen a real example of that? I have not. This > >> scenario > >> is extremely unlikely and should not define the default behavior. Again, > >> if > >> a user is so advanced to come up with such a sophisticated deployment, > >> then > >> the same user should be able to set different default behaviors for > >> different clusters. > >> > >> > > >
Re: IEP-14: Ignite failures handling (Discussion)
Ivan, If grid hangs, graceful shutdown would most likely hang as well. Almost never you can recover from a bad state using graceful procedures. I agree that we should not create two defaults, especially in this case. It's not even strictly defined what is embedded node in Ignite. For example, if I start it using a custom main class and/or custom script instead of ignite.sh, would it be embedded or standalone node? -Val On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakovwrote: > One more note: "kill if standalone, stop if embedded" differs from what > you are suggesting "try graceful, then kill process regardless" only in > case when graceful shutdown hangs. > Do we have understanding, how often does graceful shutdown hang? > Obviously, *grid hang* is often case, but it shouldn't be messed with > *graceful shutdown hang*. From my experience, if something went wrong, > users just prefer to do kill -9 because it's much more reliable and easy. > Probably, in most of cases when kill -9 worked, graceful stop would have > worked as well - we just don't have such statistics. > It may be bad example, but: in our CI tests we intentionally break grid in > many harsh ways and perform a graceful stop after the test execution, and > it doesn't hang - otherwise we'd see many "Execution timeout" test suite > hangs. > > Best Regards, > Ivan Rakov > > > On 14.03.2018 2:24, Dmitriy Setrakyan wrote: > >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov >> wrote: >> >> I just would like to add my +1 for "kill if standalone, stop if embedded" >>> default option. My arguments: >>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop": >>> Unfortunately, it's true that Ignite can hang during stop procedure. >>> However, most of failures described under IEP-14 (storage IO exceptions, >>> death of critical system worker thread, etc) normally shouldn't turn node >>> into "impossible to stop" state. Turning into that state is a bug >>> itself. I >>> guess that we shouldn't choose system behavior on the basis of known >>> bugs. >>> >> >> The whole discussion is about protecting against force-major issues, >> including Ignite bugs. You are assuming that a user application will >> somehow continue to function if an Ignite node is stopped. In most cases >> it >> will just freeze itself and cause the rest of the application to hang. >> >> Again, "kill+stop" is the most deterministic and the safest default >> behavior. Try a graceful shutdown (which will make restart easier), and >> then kill the process regardless. >> >> Note that we are arguing about the default behavior. If a user does not >> like this default, then this user can change it to another behavior. >> >> >> 2) User might want to handle Ignite node crash before shutting down the >>> whole JVM - raise alert, close external resources, etc >>> >>> Very unlikely, but if a user is this advanced, then this user can change >> the default behavior. Most users will not even know how to configure such >> custom shutdown behavior and would prefer an automatic kill. >> >> 3) IEP-14 document has important notes: "More than one Ignite node could >> be >> >>> started in one JVM process" and "Different nodes in one JVM process could >>> belong to different clusters". This is possible only in embedded mode. I >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with >>> another healthy nodes) if there's a chance of successful node stop. >>> >>> Has anyone actually seen a real example of that? I have not. This >> scenario >> is extremely unlikely and should not define the default behavior. Again, >> if >> a user is so advanced to come up with such a sophisticated deployment, >> then >> the same user should be able to set different default behaviors for >> different clusters. >> >> >
Re: IEP-14: Ignite failures handling (Discussion)
One more note: "kill if standalone, stop if embedded" differs from what you are suggesting "try graceful, then kill process regardless" only in case when graceful shutdown hangs. Do we have understanding, how often does graceful shutdown hang? Obviously, *grid hang* is often case, but it shouldn't be messed with *graceful shutdown hang*. From my experience, if something went wrong, users just prefer to do kill -9 because it's much more reliable and easy. Probably, in most of cases when kill -9 worked, graceful stop would have worked as well - we just don't have such statistics. It may be bad example, but: in our CI tests we intentionally break grid in many harsh ways and perform a graceful stop after the test execution, and it doesn't hang - otherwise we'd see many "Execution timeout" test suite hangs. Best Regards, Ivan Rakov On 14.03.2018 2:24, Dmitriy Setrakyan wrote: On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakovwrote: I just would like to add my +1 for "kill if standalone, stop if embedded" default option. My arguments: 1) Regarding "If Ignite hangs - it will likely be impossible to stop": Unfortunately, it's true that Ignite can hang during stop procedure. However, most of failures described under IEP-14 (storage IO exceptions, death of critical system worker thread, etc) normally shouldn't turn node into "impossible to stop" state. Turning into that state is a bug itself. I guess that we shouldn't choose system behavior on the basis of known bugs. The whole discussion is about protecting against force-major issues, including Ignite bugs. You are assuming that a user application will somehow continue to function if an Ignite node is stopped. In most cases it will just freeze itself and cause the rest of the application to hang. Again, "kill+stop" is the most deterministic and the safest default behavior. Try a graceful shutdown (which will make restart easier), and then kill the process regardless. Note that we are arguing about the default behavior. If a user does not like this default, then this user can change it to another behavior. 2) User might want to handle Ignite node crash before shutting down the whole JVM - raise alert, close external resources, etc Very unlikely, but if a user is this advanced, then this user can change the default behavior. Most users will not even know how to configure such custom shutdown behavior and would prefer an automatic kill. 3) IEP-14 document has important notes: "More than one Ignite node could be started in one JVM process" and "Different nodes in one JVM process could belong to different clusters". This is possible only in embedded mode. I think, we shouldn't shock user by sudden JVM halt (possibly, along with another healthy nodes) if there's a chance of successful node stop. Has anyone actually seen a real example of that? I have not. This scenario is extremely unlikely and should not define the default behavior. Again, if a user is so advanced to come up with such a sophisticated deployment, then the same user should be able to set different default behaviors for different clusters.
Re: IEP-14: Ignite failures handling (Discussion)
On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakovwrote: > I just would like to add my +1 for "kill if standalone, stop if embedded" > default option. My arguments: > > 1) Regarding "If Ignite hangs - it will likely be impossible to stop": > Unfortunately, it's true that Ignite can hang during stop procedure. > However, most of failures described under IEP-14 (storage IO exceptions, > death of critical system worker thread, etc) normally shouldn't turn node > into "impossible to stop" state. Turning into that state is a bug itself. I > guess that we shouldn't choose system behavior on the basis of known bugs. The whole discussion is about protecting against force-major issues, including Ignite bugs. You are assuming that a user application will somehow continue to function if an Ignite node is stopped. In most cases it will just freeze itself and cause the rest of the application to hang. Again, "kill+stop" is the most deterministic and the safest default behavior. Try a graceful shutdown (which will make restart easier), and then kill the process regardless. Note that we are arguing about the default behavior. If a user does not like this default, then this user can change it to another behavior. > 2) User might want to handle Ignite node crash before shutting down the > whole JVM - raise alert, close external resources, etc > Very unlikely, but if a user is this advanced, then this user can change the default behavior. Most users will not even know how to configure such custom shutdown behavior and would prefer an automatic kill. 3) IEP-14 document has important notes: "More than one Ignite node could be > started in one JVM process" and "Different nodes in one JVM process could > belong to different clusters". This is possible only in embedded mode. I > think, we shouldn't shock user by sudden JVM halt (possibly, along with > another healthy nodes) if there's a chance of successful node stop. > Has anyone actually seen a real example of that? I have not. This scenario is extremely unlikely and should not define the default behavior. Again, if a user is so advanced to come up with such a sophisticated deployment, then the same user should be able to set different default behaviors for different clusters.
Re: IEP-14: Ignite failures handling (Discussion)
I just would like to add my +1 for "kill if standalone, stop if embedded" default option. My arguments: 1) Regarding "If Ignite hangs - it will likely be impossible to stop": Unfortunately, it's true that Ignite can hang during stop procedure. However, most of failures described under IEP-14 (storage IO exceptions, death of critical system worker thread, etc) normally shouldn't turn node into "impossible to stop" state. Turning into that state is a bug itself. I guess that we shouldn't choose system behavior on the basis of known bugs. 2) User might want to handle Ignite node crash before shutting down the whole JVM - raise alert, close external resources, etc 3) IEP-14 document has important notes: "More than one Ignite node could be started in one JVM process" and "Different nodes in one JVM process could belong to different clusters". This is possible only in embedded mode. I think, we shouldn't shock user by sudden JVM halt (possibly, along with another healthy nodes) if there's a chance of successful node stop. Best Regards, Ivan Rakov On 14.03.2018 1:47, Dmitriy Setrakyan wrote: Guys, I do not think there is an understanding here. If Ignite hangs - it will likely be impossible to stop. So if you are suggesting "stop if embedded", you might as well suggest "do nothing if embedded". I have seen many Ignite deployments, embedded or not, large and small, and in all those deployments if Ignite went into a frozen state, killing it was the best option. Moreover, it provided the most predictable behavior. I am not guessing here, but it seems to me that the rest of the community is guessing. Killing a frozen Ignite node should be a default behavior in all cases, embedded or not. Stopping a frozen Ignite node should be a configurable option, so a user has an ability to turn off auto-kill behavior. We should also have a 3rd option, "stop+kill", so if stopping fails, then the process is automatically killed (this is also a good default option). Personally, I am OK if the default behavior is "kill" or "stop+kill", but it should be the same default in all cases. We should stop the practice of creating different default behaviors for the same problem. It is confusing and hard to document. D. On Tue, Mar 13, 2018 at 2:19 PM, Denis Magdawrote: +1 for "kill if standalone, stop if embedded" behavior. If the practice shows that the node should be killed regardless of the mode, then it will be an easy change. Now we are just guessing, and common sense suggests going for "kill if standalone, stop if embedded" until we get feedback. - Denis On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov wrote: You are suggesting to kill the process, which was not started by Ignite, are not you? More consistently is to stop only those processes that are generated by the control of Ignite, e.g. from ignite.sh - here it is ok for me. If we relese 'kill by default' as part of 2.5, we will end up with 2.6 emergency release to change it back, if one user will face with such unexpected behaviour. вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan : Dmitriy, I think everyone is suggesting that stopping the node will likely be impossible if Ignite is frozen. Moreover, it is very likely that all other apps are frozen too. My comments are below... On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov wrote: Please consider that user application may use Ignite as optional cache for some low-priority feature, but main logic is well functioning without Ingnite. I can say, as Ignite user in the past, that it is quite real case. I have been a part of this project for a while, but I have never seen Ignite used as an optional cache. Usually, Ignite is a mandatory part of the application, not optional. Second real case is using several war files within one application server, running different logic. Some apps use Ignite, some applications - not. Killing application server in this case is not an option too. Not very likely, but possible. This is not a common use case. Most commonly Ignite would be serving all WAR files with a common data layer. So default should be stopping all node threads, but not kill the process. If user is aware process may be killed, it may setup option. No, the default should be to kill the process. If user does not like it, then it should be possible to change it to stop the node first. вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan < dsetrak...@apache.org : On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov < dpavlov@gmail.com> wrote: Dmitriy, alternative is "kill if standalone, stop if embedded" User will be still able to set something like -DNODE_CRASH_ACTION="kill" if ignite.sh is not used and user accepts alternative that whole process would be killed if node is crashed. Default would be 'node stop', but not hang up infinetely. Dmitriy, if Ignite if frozen,
Re: IEP-14: Ignite failures handling (Discussion)
On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlovwrote: > What do you think if stop is default for all cases? > > Kill is configurable. > > We can consider enforse sockets close for 'stop'. This will allow to ignore > hang node by rest of the cluster. > Dmitriy, I see that you cannot come to terms with stopping a process that was not started by Ignite. However, in majority of the deployments, users would prefer that you would "kill" the process instead of leaving it running in a "frozen" state. Frozen state is non-deterministic and it is impossible to create a recovery for it. Killing the process is very deterministic and can be recovered by restarting it in most cases. "stop" does not fix the problem we are trying to solve. The whole point is to prevent frozen state, and "stop" without "kill" does not prevent it. I am OK if "stop+kill" is the default behavior, which means that we try a graceful shutdown and then always kill the process anyway. I think we should have the following configurable options: - "stop+kill" (default) - "kill" - "stop" - "stop+restart" (if stop fails, we should kill regardless)
Re: IEP-14: Ignite failures handling (Discussion)
What do you think if stop is default for all cases? Kill is configurable. We can consider enforse sockets close for 'stop'. This will allow to ignore hang node by rest of the cluster. ср, 14 мар. 2018 г., 1:48 Dmitriy Setrakyan: > Guys, I do not think there is an understanding here. If Ignite hangs - it > will likely be impossible to stop. So if you are suggesting "stop if > embedded", you might as well suggest "do nothing if embedded". > > I have seen many Ignite deployments, embedded or not, large and small, and > in all those deployments if Ignite went into a frozen state, killing it was > the best option. Moreover, it provided the most predictable behavior. I am > not guessing here, but it seems to me that the rest of the community is > guessing. > > Killing a frozen Ignite node should be a default behavior in all cases, > embedded or not. Stopping a frozen Ignite node should be a configurable > option, so a user has an ability to turn off auto-kill behavior. We should > also have a 3rd option, "stop+kill", so if stopping fails, then the process > is automatically killed (this is also a good default option). > > Personally, I am OK if the default behavior is "kill" or "stop+kill", but > it should be the same default in all cases. We should stop the practice of > creating different default behaviors for the same problem. It is confusing > and hard to document. > > D. > > On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda wrote: > > > +1 for "kill if standalone, stop if embedded" behavior. If the practice > > shows that the node should be killed regardless of the mode, then it will > > be an easy change. Now we are just guessing, and common sense suggests > > going for "kill if standalone, stop if embedded" until we get feedback. > > > > - > > Denis > > > > On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov > > wrote: > > > > > You are suggesting to kill the process, which was not started by > Ignite, > > > are not you? > > > > > > More consistently is to stop only those processes that are generated by > > the > > > control of Ignite, e.g. from ignite.sh - here it is ok for me. > > > > > > If we relese 'kill by default' as part of 2.5, we will end up with 2.6 > > > emergency release to change it back, if one user will face with such > > > unexpected behaviour. > > > > > > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan >: > > > > > > > Dmitriy, > > > > > > > > I think everyone is suggesting that stopping the node will likely be > > > > impossible if Ignite is frozen. Moreover, it is very likely that all > > > other > > > > apps are frozen too. > > > > > > > > My comments are below... > > > > > > > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov < > dpavlov@gmail.com> > > > > wrote: > > > > > > > > > Please consider that user application may use Ignite as optional > > cache > > > > for > > > > > some low-priority feature, but main logic is well functioning > without > > > > > Ingnite. I can say, as Ignite user in the past, that it is quite > real > > > > case. > > > > > > > > > > > > > I have been a part of this project for a while, but I have never seen > > > > Ignite used as an optional cache. Usually, Ignite is a mandatory part > > of > > > > the application, not optional. > > > > > > > > > > > > > Second real case is using several war files within one application > > > > server, > > > > > running different logic. Some apps use Ignite, some applications - > > not. > > > > > Killing application server in this case is not an option too. > > > > > > > > > > > > > Not very likely, but possible. This is not a common use case. Most > > > commonly > > > > Ignite would be serving all WAR files with a common data layer. > > > > > > > > > > > > > > > > > > So default should be stopping all node threads, but not kill the > > > process. > > > > > If user is aware process may be killed, it may setup option. > > > > > > > > > > > > > No, the default should be to kill the process. If user does not like > > it, > > > > then it should be possible to change it to stop the node first. > > > > > > > > > > > > > > > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan < > > dsetrak...@apache.org > > > >: > > > > > > > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov < > > > dpavlov@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > > > > > > > > > > > > > > > > User will be still able to set something like > > > > > > > -DNODE_CRASH_ACTION="kill" > > > > > > > if ignite.sh is not used and user accepts alternative that > whole > > > > > process > > > > > > > would be killed if node is crashed. > > > > > > > > > > > > > > Default would be 'node stop', but not hang up infinetely. > > > > > > > > > > > > > > > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. > The > > > only > > > > > > guaranteed way to "un-freeze" the cluster is
Re: IEP-14: Ignite failures handling (Discussion)
Guys, I do not think there is an understanding here. If Ignite hangs - it will likely be impossible to stop. So if you are suggesting "stop if embedded", you might as well suggest "do nothing if embedded". I have seen many Ignite deployments, embedded or not, large and small, and in all those deployments if Ignite went into a frozen state, killing it was the best option. Moreover, it provided the most predictable behavior. I am not guessing here, but it seems to me that the rest of the community is guessing. Killing a frozen Ignite node should be a default behavior in all cases, embedded or not. Stopping a frozen Ignite node should be a configurable option, so a user has an ability to turn off auto-kill behavior. We should also have a 3rd option, "stop+kill", so if stopping fails, then the process is automatically killed (this is also a good default option). Personally, I am OK if the default behavior is "kill" or "stop+kill", but it should be the same default in all cases. We should stop the practice of creating different default behaviors for the same problem. It is confusing and hard to document. D. On Tue, Mar 13, 2018 at 2:19 PM, Denis Magdawrote: > +1 for "kill if standalone, stop if embedded" behavior. If the practice > shows that the node should be killed regardless of the mode, then it will > be an easy change. Now we are just guessing, and common sense suggests > going for "kill if standalone, stop if embedded" until we get feedback. > > - > Denis > > On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov > wrote: > > > You are suggesting to kill the process, which was not started by Ignite, > > are not you? > > > > More consistently is to stop only those processes that are generated by > the > > control of Ignite, e.g. from ignite.sh - here it is ok for me. > > > > If we relese 'kill by default' as part of 2.5, we will end up with 2.6 > > emergency release to change it back, if one user will face with such > > unexpected behaviour. > > > > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan : > > > > > Dmitriy, > > > > > > I think everyone is suggesting that stopping the node will likely be > > > impossible if Ignite is frozen. Moreover, it is very likely that all > > other > > > apps are frozen too. > > > > > > My comments are below... > > > > > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov > > > wrote: > > > > > > > Please consider that user application may use Ignite as optional > cache > > > for > > > > some low-priority feature, but main logic is well functioning without > > > > Ingnite. I can say, as Ignite user in the past, that it is quite real > > > case. > > > > > > > > > > I have been a part of this project for a while, but I have never seen > > > Ignite used as an optional cache. Usually, Ignite is a mandatory part > of > > > the application, not optional. > > > > > > > > > > Second real case is using several war files within one application > > > server, > > > > running different logic. Some apps use Ignite, some applications - > not. > > > > Killing application server in this case is not an option too. > > > > > > > > > > Not very likely, but possible. This is not a common use case. Most > > commonly > > > Ignite would be serving all WAR files with a common data layer. > > > > > > > > > > > > > > So default should be stopping all node threads, but not kill the > > process. > > > > If user is aware process may be killed, it may setup option. > > > > > > > > > > No, the default should be to kill the process. If user does not like > it, > > > then it should be possible to change it to stop the node first. > > > > > > > > > > > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan < > dsetrak...@apache.org > > >: > > > > > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov < > > dpavlov@gmail.com> > > > > > wrote: > > > > > > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > > > > > > > > > > > > > User will be still able to set something like > > > > > > -DNODE_CRASH_ACTION="kill" > > > > > > if ignite.sh is not used and user accepts alternative that whole > > > > process > > > > > > would be killed if node is crashed. > > > > > > > > > > > > Default would be 'node stop', but not hang up infinetely. > > > > > > > > > > > > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The > > only > > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen > JVM. > > > > > > > > > > On top of that, it is very likely that if you stop the "embedded" > > > Ignite, > > > > > the user application will not be able to function any way, so > killing > > > the > > > > > node does sound like a better and *safer* option. > > > > > > > > > > D. > > > > > > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
+1 for "kill if standalone, stop if embedded" behavior. If the practice shows that the node should be killed regardless of the mode, then it will be an easy change. Now we are just guessing, and common sense suggests going for "kill if standalone, stop if embedded" until we get feedback. - Denis On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlovwrote: > You are suggesting to kill the process, which was not started by Ignite, > are not you? > > More consistently is to stop only those processes that are generated by the > control of Ignite, e.g. from ignite.sh - here it is ok for me. > > If we relese 'kill by default' as part of 2.5, we will end up with 2.6 > emergency release to change it back, if one user will face with such > unexpected behaviour. > > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan : > > > Dmitriy, > > > > I think everyone is suggesting that stopping the node will likely be > > impossible if Ignite is frozen. Moreover, it is very likely that all > other > > apps are frozen too. > > > > My comments are below... > > > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov > > wrote: > > > > > Please consider that user application may use Ignite as optional cache > > for > > > some low-priority feature, but main logic is well functioning without > > > Ingnite. I can say, as Ignite user in the past, that it is quite real > > case. > > > > > > > I have been a part of this project for a while, but I have never seen > > Ignite used as an optional cache. Usually, Ignite is a mandatory part of > > the application, not optional. > > > > > > > Second real case is using several war files within one application > > server, > > > running different logic. Some apps use Ignite, some applications - not. > > > Killing application server in this case is not an option too. > > > > > > > Not very likely, but possible. This is not a common use case. Most > commonly > > Ignite would be serving all WAR files with a common data layer. > > > > > > > > > > So default should be stopping all node threads, but not kill the > process. > > > If user is aware process may be killed, it may setup option. > > > > > > > No, the default should be to kill the process. If user does not like it, > > then it should be possible to change it to stop the node first. > > > > > > > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan >: > > > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov < > dpavlov@gmail.com> > > > > wrote: > > > > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > > > > > > > > > > User will be still able to set something like > > > > > -DNODE_CRASH_ACTION="kill" > > > > > if ignite.sh is not used and user accepts alternative that whole > > > process > > > > > would be killed if node is crashed. > > > > > > > > > > Default would be 'node stop', but not hang up infinetely. > > > > > > > > > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The > only > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM. > > > > > > > > On top of that, it is very likely that if you stop the "embedded" > > Ignite, > > > > the user application will not be able to function any way, so killing > > the > > > > node does sound like a better and *safer* option. > > > > > > > > D. > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
You are suggesting to kill the process, which was not started by Ignite, are not you? More consistently is to stop only those processes that are generated by the control of Ignite, e.g. from ignite.sh - here it is ok for me. If we relese 'kill by default' as part of 2.5, we will end up with 2.6 emergency release to change it back, if one user will face with such unexpected behaviour. вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan: > Dmitriy, > > I think everyone is suggesting that stopping the node will likely be > impossible if Ignite is frozen. Moreover, it is very likely that all other > apps are frozen too. > > My comments are below... > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov > wrote: > > > Please consider that user application may use Ignite as optional cache > for > > some low-priority feature, but main logic is well functioning without > > Ingnite. I can say, as Ignite user in the past, that it is quite real > case. > > > > I have been a part of this project for a while, but I have never seen > Ignite used as an optional cache. Usually, Ignite is a mandatory part of > the application, not optional. > > > > Second real case is using several war files within one application > server, > > running different logic. Some apps use Ignite, some applications - not. > > Killing application server in this case is not an option too. > > > > Not very likely, but possible. This is not a common use case. Most commonly > Ignite would be serving all WAR files with a common data layer. > > > > > > So default should be stopping all node threads, but not kill the process. > > If user is aware process may be killed, it may setup option. > > > > No, the default should be to kill the process. If user does not like it, > then it should be possible to change it to stop the node first. > > > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan : > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov > > > wrote: > > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > > > > > > > User will be still able to set something like > > > > -DNODE_CRASH_ACTION="kill" > > > > if ignite.sh is not used and user accepts alternative that whole > > process > > > > would be killed if node is crashed. > > > > > > > > Default would be 'node stop', but not hang up infinetely. > > > > > > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only > > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM. > > > > > > On top of that, it is very likely that if you stop the "embedded" > Ignite, > > > the user application will not be able to function any way, so killing > the > > > node does sound like a better and *safer* option. > > > > > > D. > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
Dmitriy, I think everyone is suggesting that stopping the node will likely be impossible if Ignite is frozen. Moreover, it is very likely that all other apps are frozen too. My comments are below... On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlovwrote: > Please consider that user application may use Ignite as optional cache for > some low-priority feature, but main logic is well functioning without > Ingnite. I can say, as Ignite user in the past, that it is quite real case. > I have been a part of this project for a while, but I have never seen Ignite used as an optional cache. Usually, Ignite is a mandatory part of the application, not optional. > Second real case is using several war files within one application server, > running different logic. Some apps use Ignite, some applications - not. > Killing application server in this case is not an option too. > Not very likely, but possible. This is not a common use case. Most commonly Ignite would be serving all WAR files with a common data layer. > > So default should be stopping all node threads, but not kill the process. > If user is aware process may be killed, it may setup option. > No, the default should be to kill the process. If user does not like it, then it should be possible to change it to stop the node first. > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan : > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov > > wrote: > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > > > > User will be still able to set something like > > > -DNODE_CRASH_ACTION="kill" > > > if ignite.sh is not used and user accepts alternative that whole > process > > > would be killed if node is crashed. > > > > > > Default would be 'node stop', but not hang up infinetely. > > > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM. > > > > On top of that, it is very likely that if you stop the "embedded" Ignite, > > the user application will not be able to function any way, so killing the > > node does sound like a better and *safer* option. > > > > D. > > >
Re: IEP-14: Ignite failures handling (Discussion)
Please consider that user application may use Ignite as optional cache for some low-priority feature, but main logic is well functioning without Ingnite. I can say, as Ignite user in the past, that it is quite real case. Second real case is using several war files within one application server, running different logic. Some apps use Ignite, some applications - not. Killing application server in this case is not an option too. So default should be stopping all node threads, but not kill the process. If user is aware process may be killed, it may setup option. вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan: > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov > wrote: > > > Dmitriy, alternative is "kill if standalone, stop if embedded" > > > > User will be still able to set something like > > -DNODE_CRASH_ACTION="kill" > > if ignite.sh is not used and user accepts alternative that whole process > > would be killed if node is crashed. > > > > Default would be 'node stop', but not hang up infinetely. > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM. > > On top of that, it is very likely that if you stop the "embedded" Ignite, > the user application will not be able to function any way, so killing the > node does sound like a better and *safer* option. > > D. >
Re: IEP-14: Ignite failures handling (Discussion)
On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlovwrote: > Dmitriy, alternative is "kill if standalone, stop if embedded" > User will be still able to set something like > -DNODE_CRASH_ACTION="kill" > if ignite.sh is not used and user accepts alternative that whole process > would be killed if node is crashed. > > Default would be 'node stop', but not hang up infinetely. > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only guaranteed way to "un-freeze" the cluster is to kill the frozen JVM. On top of that, it is very likely that if you stop the "embedded" Ignite, the user application will not be able to function any way, so killing the node does sound like a better and *safer* option. D.
Re: IEP-14: Ignite failures handling (Discussion)
The most doubtful thing is 'stopping'. What if node does not respond due to critical failure? 2018-03-13 15:16 GMT+03:00 Dmitry Pavlov: > Dmitriy, alternative is "kill if standalone, stop if embedded" > > User will be still able to set something like > -DNODE_CRASH_ACTION="kill" > if ignite.sh is not used and user accepts alternative that whole process > would be killed if node is crashed. > > Default would be 'node stop', but not hang up infinetely. > > Sincerely, > Dmitriy Pavlov > > вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan : > > -- Best regards, Andrey Kuznetsov.
Re: IEP-14: Ignite failures handling (Discussion)
Dmitriy, alternative is "kill if standalone, stop if embedded" User will be still able to set something like -DNODE_CRASH_ACTION="kill" if ignite.sh is not used and user accepts alternative that whole process would be killed if node is crashed. Default would be 'node stop', but not hang up infinetely. Sincerely, Dmitriy Pavlov вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan: > Guys, I do not understand the alternative. If Ignite is frozen and causes > the whole grid to freeze, how can we justify not killing it? Will uses > rather have their applications freeze? > > I would consider real life use cases here. Can someone present a life > example where keeping a frozen grid node around is better than killing JVM? > > D. > > On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk < > alexey.goncha...@gmail.com> wrote: > > > I also like "kill if standalone, stop if embedded" by default. A use can > > change it to kill for embedded mode, but it will be a controlled safe > > choice. > > > > 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov : > > > > > +1 for "kill if standalone, stop if embedded". We should never kill a > > > process in embedded node because it might be disastrous for user > > > application. > > > > > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov > > > > wrote: > > > > > > > Denis, Dmitriy, I am not sure I agree here, please see close > analogue - > > > JVM > > > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default. > > > > > > > > If server node is started from sh script, kill OK for me, as process > is > > > > controlled only by ignite. It is sufficient to add option to > override > > > > default for sh script. > > > > > > > > Users interested in this behaviour may also setup this option to > "kill" > > > > > > > > If server node is started from java, it should never kill whole > > process. > > > > This mode is not prohibited by docs, users are allowed to start > several > > > > nodes in one process, run its own application logic in this node. > > > > > > > > Why we should kill user code running? It could be negative surprise > to > > > > user. > > > > > > > > > > > > > > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan >: > > > > > > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev < > > > andrewkor...@hotmail.com > > > > > > > > > > wrote: > > > > > > > > > > > I believe the only reasonable way to handle a critical system > > failure > > > > (as > > > > > > it is defined in the IEP) is a JVM halt (not a graceful > > > > exit/shutdown!). > > > > > > The sooner - the better, lesser impact. There’s simply no way to > > > reason > > > > > > about the state of the system in a situation like that, all bets > > are > > > > off. > > > > > > Any other policy would only confuse the matters and in all > > likelihood > > > > > make > > > > > > things worse. > > > > > > > > > > > > In practice, SREs/Operations would very much rather have a > process > > > die > > > > a > > > > > > quick clean death, than let it run indefinitely and hope that > it’ll > > > > > somehow > > > > > > recover by itself at some point in future, potentially degrading > > the > > > > > > overall system stability and availability all the while. > > > > > > > > > > > > > > > > Completely agree. > > > > > > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
Guys, I do not understand the alternative. If Ignite is frozen and causes the whole grid to freeze, how can we justify not killing it? Will uses rather have their applications freeze? I would consider real life use cases here. Can someone present a life example where keeping a frozen grid node around is better than killing JVM? D. On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > I also like "kill if standalone, stop if embedded" by default. A use can > change it to kill for embedded mode, but it will be a controlled safe > choice. > > 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov: > > > +1 for "kill if standalone, stop if embedded". We should never kill a > > process in embedded node because it might be disastrous for user > > application. > > > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov > > wrote: > > > > > Denis, Dmitriy, I am not sure I agree here, please see close analogue - > > JVM > > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default. > > > > > > If server node is started from sh script, kill OK for me, as process is > > > controlled only by ignite. It is sufficient to add option to override > > > default for sh script. > > > > > > Users interested in this behaviour may also setup this option to "kill" > > > > > > If server node is started from java, it should never kill whole > process. > > > This mode is not prohibited by docs, users are allowed to start several > > > nodes in one process, run its own application logic in this node. > > > > > > Why we should kill user code running? It could be negative surprise to > > > user. > > > > > > > > > > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan : > > > > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev < > > andrewkor...@hotmail.com > > > > > > > > wrote: > > > > > > > > > I believe the only reasonable way to handle a critical system > failure > > > (as > > > > > it is defined in the IEP) is a JVM halt (not a graceful > > > exit/shutdown!). > > > > > The sooner - the better, lesser impact. There’s simply no way to > > reason > > > > > about the state of the system in a situation like that, all bets > are > > > off. > > > > > Any other policy would only confuse the matters and in all > likelihood > > > > make > > > > > things worse. > > > > > > > > > > In practice, SREs/Operations would very much rather have a process > > die > > > a > > > > > quick clean death, than let it run indefinitely and hope that it’ll > > > > somehow > > > > > recover by itself at some point in future, potentially degrading > the > > > > > overall system stability and availability all the while. > > > > > > > > > > > > > Completely agree. > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
I also like "kill if standalone, stop if embedded" by default. A use can change it to kill for embedded mode, but it will be a controlled safe choice. 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov: > +1 for "kill if standalone, stop if embedded". We should never kill a > process in embedded node because it might be disastrous for user > application. > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov > wrote: > > > Denis, Dmitriy, I am not sure I agree here, please see close analogue - > JVM > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default. > > > > If server node is started from sh script, kill OK for me, as process is > > controlled only by ignite. It is sufficient to add option to override > > default for sh script. > > > > Users interested in this behaviour may also setup this option to "kill" > > > > If server node is started from java, it should never kill whole process. > > This mode is not prohibited by docs, users are allowed to start several > > nodes in one process, run its own application logic in this node. > > > > Why we should kill user code running? It could be negative surprise to > > user. > > > > > > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan : > > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev < > andrewkor...@hotmail.com > > > > > > wrote: > > > > > > > I believe the only reasonable way to handle a critical system failure > > (as > > > > it is defined in the IEP) is a JVM halt (not a graceful > > exit/shutdown!). > > > > The sooner - the better, lesser impact. There’s simply no way to > reason > > > > about the state of the system in a situation like that, all bets are > > off. > > > > Any other policy would only confuse the matters and in all likelihood > > > make > > > > things worse. > > > > > > > > In practice, SREs/Operations would very much rather have a process > die > > a > > > > quick clean death, than let it run indefinitely and hope that it’ll > > > somehow > > > > recover by itself at some point in future, potentially degrading the > > > > overall system stability and availability all the while. > > > > > > > > > > Completely agree. > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
+1 for "kill if standalone, stop if embedded". We should never kill a process in embedded node because it might be disastrous for user application. On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlovwrote: > Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM > itself, and its parameter ExitOnOutOfMemoryError,- it is not default. > > If server node is started from sh script, kill OK for me, as process is > controlled only by ignite. It is sufficient to add option to override > default for sh script. > > Users interested in this behaviour may also setup this option to "kill" > > If server node is started from java, it should never kill whole process. > This mode is not prohibited by docs, users are allowed to start several > nodes in one process, run its own application logic in this node. > > Why we should kill user code running? It could be negative surprise to > user. > > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan : > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev > > > wrote: > > > > > I believe the only reasonable way to handle a critical system failure > (as > > > it is defined in the IEP) is a JVM halt (not a graceful > exit/shutdown!). > > > The sooner - the better, lesser impact. There’s simply no way to reason > > > about the state of the system in a situation like that, all bets are > off. > > > Any other policy would only confuse the matters and in all likelihood > > make > > > things worse. > > > > > > In practice, SREs/Operations would very much rather have a process die > a > > > quick clean death, than let it run indefinitely and hope that it’ll > > somehow > > > recover by itself at some point in future, potentially degrading the > > > overall system stability and availability all the while. > > > > > > > Completely agree. > > >
Re: IEP-14: Ignite failures handling (Discussion)
Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM itself, and its parameter ExitOnOutOfMemoryError,- it is not default. If server node is started from sh script, kill OK for me, as process is controlled only by ignite. It is sufficient to add option to override default for sh script. Users interested in this behaviour may also setup this option to "kill" If server node is started from java, it should never kill whole process. This mode is not prohibited by docs, users are allowed to start several nodes in one process, run its own application logic in this node. Why we should kill user code running? It could be negative surprise to user. вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan: > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev > wrote: > > > I believe the only reasonable way to handle a critical system failure (as > > it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!). > > The sooner - the better, lesser impact. There’s simply no way to reason > > about the state of the system in a situation like that, all bets are off. > > Any other policy would only confuse the matters and in all likelihood > make > > things worse. > > > > In practice, SREs/Operations would very much rather have a process die a > > quick clean death, than let it run indefinitely and hope that it’ll > somehow > > recover by itself at some point in future, potentially degrading the > > overall system stability and availability all the while. > > > > Completely agree. >
Re: IEP-14: Ignite failures handling (Discussion)
On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornevwrote: > I believe the only reasonable way to handle a critical system failure (as > it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!). > The sooner - the better, lesser impact. There’s simply no way to reason > about the state of the system in a situation like that, all bets are off. > Any other policy would only confuse the matters and in all likelihood make > things worse. > > In practice, SREs/Operations would very much rather have a process die a > quick clean death, than let it run indefinitely and hope that it’ll somehow > recover by itself at some point in future, potentially degrading the > overall system stability and availability all the while. > Completely agree.
Re: IEP-14: Ignite failures handling (Discussion)
On Mon, Mar 12, 2018 at 5:12 PM, Denis Magdawrote: > Dmitriy, > > Ignite client node is usually used in the embedded mode. By killing the > whole process, the node is running in, we're going to kill the entire > application. That doesn't sound like a good plan. That's why my suggestion > is to try to kill the node somehow instead rather than the whole process. > Agree. However, if the node cannot stop gracefully, we should kill the process anyway. This should be the default behavior. User should be able to turn it off as needed. > > As for the server nodes, which usually own the whole process, it's totally > fine to kill the process right away. > Well, even here I would still try to gracefully stop the node first. If that cannot be done, then we should kill the process. > > -- > Denis > > On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan > wrote: > > > Denis, what is the difference between killing the process and killing the > > node and the process? > > > > D. > > > > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda wrote: > > > > > Guys, > > > > > > I would make a decision depending on a type of the problematic node: > > > > > >- If it's a *server node*, then let's kill the process simply > because > > >the node usually owns the whole process. Don't see a practical > reason > > > why a > > >user wants to run 2 server nodes in a single process. > > >- If it's a *client node*, then the best approach is to kill the > node > > >and not the process. > > > > > > -- > > > Denis > > > > > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov > > > wrote: > > > > > > > Hi Andrey, Igniters, > > > > > > > > Thank you for starting this topic, because this is really important > > > > decision. > > > > > > > > JVM termination in case Ignite is started within application server > > with > > > > other application will kill all services started. > > > > > > > > So I suggest this option is not default. We can add this option > > > > (action="JVM termination") as pre-configured for ignite.sh/bat since > > we > > > > know is it separate JVM. But I do not vote for the option, if it was > > the > > > > default in code. > > > > > > > > Sincerely, > > > > Dmitriy Pavlov > > > > > > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov : > > > > > > > > > To my mind, the default action should be as severe as possible, > since > > > we > > > > > deal with critical errors, that is, entire JVM termination. In the > > case > > > > of > > > > > some custom setup (e.g. different cluster nodes in one JVM) failure > > > > > response action should be configured explicitly. > > > > > > > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura : > > > > > > > > > > > Igniters! > > > > > > > > > > > > We are working on proposal described in IEP-14 Ignite failures > > > > > > handling [1] and it's time to discuss it with community (although > > it > > > > > > was necessary to do this before). > > > > > > > > > > > > Most important question: what should be default behaviour in case > > of > > > > > > failure? There are 4 actions: > > > > > > > > > > > > 1. Restart JVM process (it's possible only if process was started > > > from > > > > > > ignite.(sh|bat) script) > > > > > > 2. Terminate JVM; > > > > > > 3. Stop node (if there is only one node in process then process > > will > > > > > > be also terminated); > > > > > > 4. No operation. > > > > > > > > > > > > I believe that node should be stopped by default. But there is > > chance > > > > > > that node will not stopped correctly. > > > > > > > > > > > > May be we should terminate JVM process by default. But it will > kill > > > > > > all nodes in the JVM process. It's especially bad behaviour in > case > > > > > > when nodes belong different Ignite clusters (real use case). > > > > > > > > > > > > May be we should restart JVM process default. This approach has > the > > > > > > same problems as the previous one. And additionally it could lead > > to > > > > > > continues restarts and, therefore, continues exchanges and > > > > > > rebalancing. > > > > > > > > > > > > Difficult choice. Could you please share your thoughts. > > > > > > > > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > 14+Ignite+failures+handling > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
Dmitriy, Ignite client node is usually used in the embedded mode. By killing the whole process, the node is running in, we're going to kill the entire application. That doesn't sound like a good plan. That's why my suggestion is to try to kill the node somehow instead rather than the whole process. As for the server nodes, which usually own the whole process, it's totally fine to kill the process right away. -- Denis On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyanwrote: > Denis, what is the difference between killing the process and killing the > node and the process? > > D. > > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda wrote: > > > Guys, > > > > I would make a decision depending on a type of the problematic node: > > > >- If it's a *server node*, then let's kill the process simply because > >the node usually owns the whole process. Don't see a practical reason > > why a > >user wants to run 2 server nodes in a single process. > >- If it's a *client node*, then the best approach is to kill the node > >and not the process. > > > > -- > > Denis > > > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov > > wrote: > > > > > Hi Andrey, Igniters, > > > > > > Thank you for starting this topic, because this is really important > > > decision. > > > > > > JVM termination in case Ignite is started within application server > with > > > other application will kill all services started. > > > > > > So I suggest this option is not default. We can add this option > > > (action="JVM termination") as pre-configured for ignite.sh/bat since > we > > > know is it separate JVM. But I do not vote for the option, if it was > the > > > default in code. > > > > > > Sincerely, > > > Dmitriy Pavlov > > > > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov : > > > > > > > To my mind, the default action should be as severe as possible, since > > we > > > > deal with critical errors, that is, entire JVM termination. In the > case > > > of > > > > some custom setup (e.g. different cluster nodes in one JVM) failure > > > > response action should be configured explicitly. > > > > > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura : > > > > > > > > > Igniters! > > > > > > > > > > We are working on proposal described in IEP-14 Ignite failures > > > > > handling [1] and it's time to discuss it with community (although > it > > > > > was necessary to do this before). > > > > > > > > > > Most important question: what should be default behaviour in case > of > > > > > failure? There are 4 actions: > > > > > > > > > > 1. Restart JVM process (it's possible only if process was started > > from > > > > > ignite.(sh|bat) script) > > > > > 2. Terminate JVM; > > > > > 3. Stop node (if there is only one node in process then process > will > > > > > be also terminated); > > > > > 4. No operation. > > > > > > > > > > I believe that node should be stopped by default. But there is > chance > > > > > that node will not stopped correctly. > > > > > > > > > > May be we should terminate JVM process by default. But it will kill > > > > > all nodes in the JVM process. It's especially bad behaviour in case > > > > > when nodes belong different Ignite clusters (real use case). > > > > > > > > > > May be we should restart JVM process default. This approach has the > > > > > same problems as the previous one. And additionally it could lead > to > > > > > continues restarts and, therefore, continues exchanges and > > > > > rebalancing. > > > > > > > > > > Difficult choice. Could you please share your thoughts. > > > > > > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > 14+Ignite+failures+handling > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrey Kuznetsov. > > > > > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
Denis, what is the difference between killing the process and killing the node and the process? D. On Mon, Mar 12, 2018 at 12:03 PM, Denis Magdawrote: > Guys, > > I would make a decision depending on a type of the problematic node: > >- If it's a *server node*, then let's kill the process simply because >the node usually owns the whole process. Don't see a practical reason > why a >user wants to run 2 server nodes in a single process. >- If it's a *client node*, then the best approach is to kill the node >and not the process. > > -- > Denis > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov > wrote: > > > Hi Andrey, Igniters, > > > > Thank you for starting this topic, because this is really important > > decision. > > > > JVM termination in case Ignite is started within application server with > > other application will kill all services started. > > > > So I suggest this option is not default. We can add this option > > (action="JVM termination") as pre-configured for ignite.sh/bat since we > > know is it separate JVM. But I do not vote for the option, if it was the > > default in code. > > > > Sincerely, > > Dmitriy Pavlov > > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov : > > > > > To my mind, the default action should be as severe as possible, since > we > > > deal with critical errors, that is, entire JVM termination. In the case > > of > > > some custom setup (e.g. different cluster nodes in one JVM) failure > > > response action should be configured explicitly. > > > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura : > > > > > > > Igniters! > > > > > > > > We are working on proposal described in IEP-14 Ignite failures > > > > handling [1] and it's time to discuss it with community (although it > > > > was necessary to do this before). > > > > > > > > Most important question: what should be default behaviour in case of > > > > failure? There are 4 actions: > > > > > > > > 1. Restart JVM process (it's possible only if process was started > from > > > > ignite.(sh|bat) script) > > > > 2. Terminate JVM; > > > > 3. Stop node (if there is only one node in process then process will > > > > be also terminated); > > > > 4. No operation. > > > > > > > > I believe that node should be stopped by default. But there is chance > > > > that node will not stopped correctly. > > > > > > > > May be we should terminate JVM process by default. But it will kill > > > > all nodes in the JVM process. It's especially bad behaviour in case > > > > when nodes belong different Ignite clusters (real use case). > > > > > > > > May be we should restart JVM process default. This approach has the > > > > same problems as the previous one. And additionally it could lead to > > > > continues restarts and, therefore, continues exchanges and > > > > rebalancing. > > > > > > > > Difficult choice. Could you please share your thoughts. > > > > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > 14+Ignite+failures+handling > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Andrey Kuznetsov. > > > > > >
Re: IEP-14: Ignite failures handling (Discussion)
Guys, I would make a decision depending on a type of the problematic node: - If it's a *server node*, then let's kill the process simply because the node usually owns the whole process. Don't see a practical reason why a user wants to run 2 server nodes in a single process. - If it's a *client node*, then the best approach is to kill the node and not the process. -- Denis On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlovwrote: > Hi Andrey, Igniters, > > Thank you for starting this topic, because this is really important > decision. > > JVM termination in case Ignite is started within application server with > other application will kill all services started. > > So I suggest this option is not default. We can add this option > (action="JVM termination") as pre-configured for ignite.sh/bat since we > know is it separate JVM. But I do not vote for the option, if it was the > default in code. > > Sincerely, > Dmitriy Pavlov > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov : > > > To my mind, the default action should be as severe as possible, since we > > deal with critical errors, that is, entire JVM termination. In the case > of > > some custom setup (e.g. different cluster nodes in one JVM) failure > > response action should be configured explicitly. > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura : > > > > > Igniters! > > > > > > We are working on proposal described in IEP-14 Ignite failures > > > handling [1] and it's time to discuss it with community (although it > > > was necessary to do this before). > > > > > > Most important question: what should be default behaviour in case of > > > failure? There are 4 actions: > > > > > > 1. Restart JVM process (it's possible only if process was started from > > > ignite.(sh|bat) script) > > > 2. Terminate JVM; > > > 3. Stop node (if there is only one node in process then process will > > > be also terminated); > > > 4. No operation. > > > > > > I believe that node should be stopped by default. But there is chance > > > that node will not stopped correctly. > > > > > > May be we should terminate JVM process by default. But it will kill > > > all nodes in the JVM process. It's especially bad behaviour in case > > > when nodes belong different Ignite clusters (real use case). > > > > > > May be we should restart JVM process default. This approach has the > > > same problems as the previous one. And additionally it could lead to > > > continues restarts and, therefore, continues exchanges and > > > rebalancing. > > > > > > Difficult choice. Could you please share your thoughts. > > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > 14+Ignite+failures+handling > > > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. > > >
Re: IEP-14: Ignite failures handling (Discussion)
Hi Andrey, Igniters, Thank you for starting this topic, because this is really important decision. JVM termination in case Ignite is started within application server with other application will kill all services started. So I suggest this option is not default. We can add this option (action="JVM termination") as pre-configured for ignite.sh/bat since we know is it separate JVM. But I do not vote for the option, if it was the default in code. Sincerely, Dmitriy Pavlov пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov: > To my mind, the default action should be as severe as possible, since we > deal with critical errors, that is, entire JVM termination. In the case of > some custom setup (e.g. different cluster nodes in one JVM) failure > response action should be configured explicitly. > > 2018-03-12 12:32 GMT+03:00 Andrey Gura : > > > Igniters! > > > > We are working on proposal described in IEP-14 Ignite failures > > handling [1] and it's time to discuss it with community (although it > > was necessary to do this before). > > > > Most important question: what should be default behaviour in case of > > failure? There are 4 actions: > > > > 1. Restart JVM process (it's possible only if process was started from > > ignite.(sh|bat) script) > > 2. Terminate JVM; > > 3. Stop node (if there is only one node in process then process will > > be also terminated); > > 4. No operation. > > > > I believe that node should be stopped by default. But there is chance > > that node will not stopped correctly. > > > > May be we should terminate JVM process by default. But it will kill > > all nodes in the JVM process. It's especially bad behaviour in case > > when nodes belong different Ignite clusters (real use case). > > > > May be we should restart JVM process default. This approach has the > > same problems as the previous one. And additionally it could lead to > > continues restarts and, therefore, continues exchanges and > > rebalancing. > > > > Difficult choice. Could you please share your thoughts. > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 14+Ignite+failures+handling > > > > > > -- > Best regards, > Andrey Kuznetsov. >
Re: IEP-14: Ignite failures handling (Discussion)
To my mind, the default action should be as severe as possible, since we deal with critical errors, that is, entire JVM termination. In the case of some custom setup (e.g. different cluster nodes in one JVM) failure response action should be configured explicitly. 2018-03-12 12:32 GMT+03:00 Andrey Gura: > Igniters! > > We are working on proposal described in IEP-14 Ignite failures > handling [1] and it's time to discuss it with community (although it > was necessary to do this before). > > Most important question: what should be default behaviour in case of > failure? There are 4 actions: > > 1. Restart JVM process (it's possible only if process was started from > ignite.(sh|bat) script) > 2. Terminate JVM; > 3. Stop node (if there is only one node in process then process will > be also terminated); > 4. No operation. > > I believe that node should be stopped by default. But there is chance > that node will not stopped correctly. > > May be we should terminate JVM process by default. But it will kill > all nodes in the JVM process. It's especially bad behaviour in case > when nodes belong different Ignite clusters (real use case). > > May be we should restart JVM process default. This approach has the > same problems as the previous one. And additionally it could lead to > continues restarts and, therefore, continues exchanges and > rebalancing. > > Difficult choice. Could you please share your thoughts. > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 14+Ignite+failures+handling > -- Best regards, Andrey Kuznetsov.
IEP-14: Ignite failures handling (Discussion)
Igniters! We are working on proposal described in IEP-14 Ignite failures handling [1] and it's time to discuss it with community (although it was necessary to do this before). Most important question: what should be default behaviour in case of failure? There are 4 actions: 1. Restart JVM process (it's possible only if process was started from ignite.(sh|bat) script) 2. Terminate JVM; 3. Stop node (if there is only one node in process then process will be also terminated); 4. No operation. I believe that node should be stopped by default. But there is chance that node will not stopped correctly. May be we should terminate JVM process by default. But it will kill all nodes in the JVM process. It's especially bad behaviour in case when nodes belong different Ignite clusters (real use case). May be we should restart JVM process default. This approach has the same problems as the previous one. And additionally it could lead to continues restarts and, therefore, continues exchanges and rebalancing. Difficult choice. Could you please share your thoughts. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling