Re: [DISCUSSION] Maintenance Mode feature

Sergey Chugunov Mon, 31 Aug 2020 07:42:38 -0700

Hello Ivan,

Thank you for raising the good question, I didn't think of Maintenance Mode
from that perspective.


In short, Maintenance Mode isn't related to Cluster States concept.
According to javadoc documentation of ClusterState enum [1] it is solely
about cache operations and to some extent doesn't affect other components
of Ignite node.
>From APIs perspective putting the methods to manage Cluster State to
IgniteCluster interface doesn't look ideal to me but it is as it is.

On the other hand Maintenance Mode as I see it will be managed through
different APIs than a ClusterState and this difference definitely will be
reflected in the documentation of the feature.

Ignite node is a complex piece of many components interacting with each
other, they may have different lifecycles and states; states of different
components cannot be reduced to the lowest common denominator.

However if you have an idea of how to call the feature better to let the
user easier distinguish it from other similar features please share it with
us. Personally I'm very welcome to any suggestions that make design more
intuitive and easy-to-use.

Thanks!

[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java

On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com> wrote:

> Hi Sergey,
>
> Thank you for bringing attention to that important subject!
>
> My note here is about one more cluster mode. As far as I know
> currently we already have 3 modes (inactive, read-only, read-write)
> and the subject is about one more. From the first glance it could be
> hard for a user to understand and use all modes properly. Do we really
> need all spectrum? Could we simplify things somehow?
>
> 2020-08-27 15:59 GMT+03:00, Sergey Chugunov <sergey.chugu...@gmail.com>:
> > Hello Nikolay,
> >
> > Created one, available by link [1]
> >
> > Initially there was an intention to develop it under IEP-47 [2] and there
> > is even a separate section for Maintenance Mode there.
> > But it looks like this feature is useful in more cases and deserves its
> own
> > IEP.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> > [2]
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> >
> > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <nizhi...@apache.org>
> > wrote:
> >
> >> Hello, Sergey!
> >>
> >> Thanks for the proposal.
> >> Let’s have IEP for this feature.
> >>
> >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <sergey.chugu...@gmail.com>
> >> написал(а):
> >> >
> >> > Hello Igniters,
> >> >
> >> > I want to start a discussion about new supporting feature that could
> be
> >> > very useful in many scenarios where persistent storage is involved:
> >> > Maintenance Mode.
> >> >
> >> > *Summary*
> >> > Maintenance Mode (MM for short) is a special state of Ignite node when
> >> node
> >> > doesn't serve user requests nor joins the cluster but waits for user
> >> > commands or performs automatic actions for maintenance purposes.
> >> >
> >> > *Motivation*
> >> > There are situations when node cannot participate in regular
> operations
> >> but
> >> > at the same time should not be shut down.
> >> >
> >> > One example is a ticket [1] where I developed the first draft of
> >> > Maintenance Mode.
> >> > Here we get into a situation when node has potentially corrupted PDS
> >> > thus
> >> > cannot proceed with restore routine and join the cluster as usual.
> >> > At the same time node should not fail nor be stopped for manual
> >> > cleanup.
> >> > Manual cleanup is not always an option (e.g. restricted access to file
> >> > system); in managed environments failed node will be restarted
> >> > automatically so user won't have time for performing necessary
> >> operations.
> >> > Thus node needs to function in a special mode allowing user to connect
> >> > to
> >> > it and perform necessary actions.
> >> >
> >> > Another example is described in IEP-47 [2] where defragmentation is
> >> > being
> >> > developed. Node defragmenting its PDS should not join the cluster
> until
> >> the
> >> > process is finished so it needs to enter Maintenance Mode as well.
> >> >
> >> > *Suggested design*
> >> > I suggest MM to work as follows:
> >> > 1. Node enters MM if special markers are found on disk. These markers
> >> > called Maintenance Records could be created automatically (e.g. when
> >> > storage component detects corrupted storage) or by user request (when
> >> user
> >> > requests defragmentation of some caches). So entering MM requires node
> >> > restart.
> >> > 2. Started in MM node doesn't join the cluster but finishes startup
> >> routine
> >> > so it is able to receive commands and provide metrics to the user.
> >> > 3. When all necessary maintenance operations are finished, Maintenance
> >> > Records for these operations are deleted from disk and node restarted
> >> again
> >> > to enter normal service.
> >> >
> >> > *Example*
> >> > To put it into a context let's consider an example of how I see the MM
> >> > workflow in case of PDS corruption.
> >> >
> >> >   1. Node has failed in the middle of checkpoint when WAL is disabled
> >> > for
> >> >   a particular cache -> data files of the cache are potentially
> >> corrupted.
> >> >   2. On next startup node detects this situation, creates Maintenance
> >> >   Record on disk and shuts down.
> >> >   3. On next startup node sees Maintenance Record, enters Maintenance
> >> Mode
> >> >   and waits for user to do specific actions: clean potentially
> >> > corrupted
> >> PDS.
> >> >   4. When user has done necessary actions he/she removes Maintenance
> >> >   Record using Maintenance Mode API exposed via control.{sh|bat}
> script
> >> or
> >> >   JMX.
> >> >   5. On next startup node goes to normal operations as maintenance
> >> > reason
> >> >   is fixed.
> >> >
> >> >
> >> > I prepared a PR [3] for ticket [1] with draft implementation. It is
> not
> >> > ready to be merged to master branch but is already fully functional
> and
> >> can
> >> > be reviewed.
> >> >
> >> > Hope you'll share your feedback on the feature and/or any thoughts on
> >> > implementation.
> >> >
> >> > Thank you!
> >> >
> >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> >> > [2]
> >> >
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> >> > [3] https://github.com/apache/ignite/pull/8189
> >>
> >>
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin
>

Re: [DISCUSSION] Maintenance Mode feature

Reply via email to