Re: Ignite not friendly for Monitoring

Vladimir Ozerov Mon, 28 Aug 2017 05:30:08 -0700

Dima,

Please see latest comments in the ticket [1]. There is special
specification called SQLSTATE governing what errors code are thrown from
SQL operations [2]. This is applicable to both JDBC and ODBC. Apart of from
standard code, database vendor can add it's own codes as a separate field,
or even extend error codes from the standard. However, as a first iteration
we should start respecting SQLSTATE spec without our own Ignite-specific
error codes.


[1] https://issues.apache.org/jira/browse/IGNITE-5620
[2]
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html#db2z_sqlstatevalues__code07

On Mon, Aug 28, 2017 at 3:23 PM, Dmitriy Setrakyan <dsetrak...@apache.org>
wrote:

> On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov <voze...@gridgain.com>
> wrote:
>
> > IGNITE-5620 is about error codes thrown from drivers. This is completely
> > different story, as every driver has specification with it's own specific
> > error codes. There is no common denominator.
> >
>
> Vova, I am not sure I understand. I would expect that drivers should
> provide the same SQL error codes as the underlying database. Perhaps,
> drivers have their custom codes for the errors in the driver itself, not in
> SQL.
>
> Can you please clarify?
>
>
> >
> > On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dma...@apache.org> wrote:
> >
> > > Vladimir,
> > >
> > > I would disagree. In IGNITE-5620 we’re going to introduce some constant
> > > error codes and prepare a sheet that will elaborate on every error.
> > That’s
> > > a part of bigger endeavor when the whole platform should be covered by
> > > special unique IDs for errors, warning and events.
> > >
> > > Now, we need to agree at least on the IDs range for SQL.
> > >
> > > —
> > > Denis
> > >
> > > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <voze...@gridgain.com>
> > > wrote:
> > > >
> > > > Denis,
> > > >
> > > > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > > > monitoring and parser errors.
> > > >
> > > > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dma...@apache.org>:
> > > >
> > > >> Alexey,
> > > >>
> > > >> Didn’t know that such an improvement as consistent IDs for errors
> and
> > > >> events can be used as an integration point with the DevOps tools.
> > Thanks
> > > >> for sharing your experience with us.
> > > >>
> > > >> Would you step in as a architect for this task and make out a JIRA
> > > ticket
> > > >> with all the required information.
> > > >>
> > > >> In general, we’ve already planned to do something around this
> starting
> > > >> with SQL:
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620>
> > > >>
> > > >> It makes sense to consider your input before the work on IGNITE-5620
> > is
> > > >> started.
> > > >>
> > > >> —
> > > >> Denis
> > > >>
> > > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> > > >> alexeykukush...@yahoo.com.INVALID> wrote:
> > > >>>
> > > >>> Hi Alexey,
> > > >>> A nice thing about delegating alerting to 3rd party enterprise
> > systems
> > > >> is that those systems already deal with lots of things including
> > > >> distributed apps.
> > > >>> What is needed from Ignite is to consistently write to log files
> > (again
> > > >> that means stable event IDs, proper event granularity, no
> repetition,
> > > >> documentation). This would be 3rd party monitoring system's
> > > responsibility
> > > >> to monitor log files on all nodes, filter, aggregate, process,
> > visualize
> > > >> and notify on events.
> > > >>> How a monitoring tool would deal with an event like "node left":
> > > >>> The only thing needed from Ignite is to write an entry like below
> to
> > > log
> > > >> files on all Ignite servers. In this example 3300 identifies this
> > "node
> > > >> left" event and will never change in the future even if text
> > description
> > > >> changes:
> > > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left
> the
> > > >> cluster
> > > >>> Then we document somewhere on the web that Ignite has event 3300
> and
> > it
> > > >> means a node left the cluster. Maybe provide documentation how to
> deal
> > > with
> > > >> it. Some examples:Oracle Web Cache events:
> > > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> > > event.htm#sthref2393MS
> > > >> SQL Server events:
> > > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > > >>> That is all for Ignite! Everything else is handled by specific
> > > >> monitoring system configured by DevOps on the customer side.
> > > >>> Basing on the Ignite documentation similar to above, DevOps of a
> > > company
> > > >> where Ignite is going to be used will configure their monitoring
> > system
> > > to
> > > >> understand Ignite events. Consider the "node left" event as an
> > example.
> > > >>> - This event is output on every node but DevOps do not want to be
> > > >> notified many times. To address this, they will build an "Ignite
> > model"
> > > >> where there will be a parent-child dependency between components
> > "Ignite
> > > >> Cluster" and "Ignite Node". For example, this is how you do it in
> > > Nagios:
> > > >> https://assets.nagios.com/downloads/nagioscore/docs/
> > > nagioscore/4/en/dependencies.html
> > > >> and this is how you do it in Microsoft SCSM:
> > > >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes.
> > Then
> > > >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> > > >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> > > components.
> > > >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> > > will
> > > >> be configured only for the "Ignite Cluster"'s "node left" monitor.-
> > Now
> > > >> suppose a node left. The "node left" monitor (that uses log file
> > > monitoring
> > > >> plugin) on "Ignite Node" will detect the event and pass it to the
> > > parent.
> > > >> This will trigger "Ignite Cluster" state change from OK to WARNING
> and
> > > send
> > > >> a notification. No more notification will be sent unless the "Ignite
> > > >> Cluster" state is reset back to OK, which happens either manually or
> > on
> > > >> timeout or automatically on "node joined".
> > > >>> This was just FYI. We, Ignite developers, do not care about how
> > > >> monitoring works - this is responsibility of customer's DevOps. Our
> > > >> responsibility is consistent event logging.
> > > >>> Thank you!
> > > >>>
> > > >>>
> > > >>> Best regards, Alexey
> > > >>>
> > > >>>
> > > >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> > > >> akuznet...@apache.org> wrote:
> > > >>>
> > > >>> Alexey,
> > > >>>
> > > >>> How you are going to deal with distributed nature of Ignite
> cluster?
> > > >>> And how do you propose handle nodes restart / stop?
> > > >>>
> > > >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > > >>> alexeykukush...@yahoo.com.invalid> wrote:
> > > >>>
> > > >>>> Hi Denis,
> > > >>>> Monitoring tools simply watch event logs for patterns (regex in
> case
> > > of
> > > >>>> unstructured logs like text files). A stable (not changing in new
> > > >> releases)
> > > >>>> event ID identifying specific issue would be such a pattern.
> > > >>>> We need to introduce such event IDs according to the principles I
> > > >>>> described in my previous mail.
> > > >>>> Best regards, Alexey
> > > >>>>
> > > >>>>
> > > >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> > > >>>> dma...@apache.org> wrote:
> > > >>>>
> > > >>>> Hello Alexey,
> > > >>>>
> > > >>>> Thanks for the detailed input.
> > > >>>>
> > > >>>> Assuming that Ignite supported the suggested events based model,
> how
> > > can
> > > >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is
> > > this
> > > >> all
> > > >>>> we need?
> > > >>>>
> > > >>>> —
> > > >>>> Denis
> > > >>>>
> > > >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin <
> > > >> alexeykukush...@yahoo.com
> > > >>>> .INVALID> wrote:
> > > >>>>>
> > > >>>>> Igniters,
> > > >>>>> While preparing some Ignite materials for Administrators I found
> > > Ignite
> > > >>>> is not friendly for such a critical DevOps practice as monitoring.
> > > >>>>> TL;DRI think Ignite misses structured descriptions of abnormal
> > events
> > > >>>> with references to event IDs in the logs not changing as new
> > versions
> > > >> are
> > > >>>> released.
> > > >>>>> MORE DETAILS
> > > >>>>> I call an application “monitoring friendly” if it allows DevOps
> to:
> > > >>>>> 1. immediately receive a notification (email, SMS, etc.)
> > > >>>>> 2. understand what a problem is without involving developers
> > > >>>>> 3. provide automated recovery action.
> > > >>>>>
> > > >>>>> Large enterprises do not implement custom solutions. They usually
> > use
> > > >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in
> the
> > > >>>> enterprise consistently. All such tools have similar architecture
> > > >> providing
> > > >>>> a dashboard showing apps as “green/yellow/red”, and numerous
> > > >> “connectors”
> > > >>>> to look for events in text logs, ESBs, database tables, etc.
> > > >>>>>
> > > >>>>> For each app DevOps build a “health model” - a diagram displaying
> > the
> > > >>>> app’s “manageable” components and the app boundaries. A
> “manageable”
> > > >>>> component is something that can be started/stopped/configured in
> > > >> isolation.
> > > >>>> “System boundary” is a list of external apps that the monitored
> app
> > > >>>> interacts with.
> > > >>>>>
> > > >>>>> The main attribute of a manageable component is a list of
> > > >> “operationally
> > > >>>> significant events”. Those are the events that DevOps can do
> > something
> > > >>>> with. For example, “failed to connect to cache store” is
> > significant,
> > > >> while
> > > >>>> “user input validation failed” is not.
> > > >>>>>
> > > >>>>> Events shall be as specific as possible so that DevOps do not
> spend
> > > >> time
> > > >>>> for further analysis. For example, a “database failure” event is
> not
> > > >> good.
> > > >>>> There should be “database connection failure”, “invalid database
> > > >> schema”,
> > > >>>> “database authentication failure”, etc. events.
> > > >>>>>
> > > >>>>> “Event” is NOT the same as exception occurred in the code. Events
> > > >>>> identify specific problem from the DevOps point of view. For
> > example,
> > > >> even
> > > >>>> if “connection to cache store failed” exception might be thrown
> from
> > > >>>> several places in the code, that is still the same event. On the
> > other
> > > >>>> side, even if a SqlServerConnectionTimeout and
> > OracleConnectionTimeout
> > > >>>> exceptions might be caught in the same place, those are different
> > > events
> > > >>>> since MS SQL Server and Oracle are usually different DevOps groups
> > in
> > > >> large
> > > >>>> enterprises!
> > > >>>>>
> > > >>>>> The operationally significant event IDs must be stable: they must
> > not
> > > >>>> change from one release to another. This is like a contract
> between
> > > >>>> developers and DevOps.
> > > >>>>>
> > > >>>>> This should be the developer’s responsibility to publish and
> > > maintain a
> > > >>>> table with attributes:
> > > >>>>>
> > > >>>>> - Event ID
> > > >>>>> - Severity: Critical (Red) - the system is not operational;
> Warning
> > > >>>> (Yellow) - the system is operational but health is degraded; None
> -
> > > >> just an
> > > >>>> info.
> > > >>>>> - Description: concise but enough for DevOps to act without
> > > developer’s
> > > >>>> help
> > > >>>>> - Recovery actions: what DevOps shall do to fix the issue without
> > > >>>> developer’s help. DevOps might create automated recovery scripts
> > based
> > > >> on
> > > >>>> this information.
> > > >>>>>
> > > >>>>> For example:
> > > >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery
> > nodes
> > > -
> > > >>>> 1) Open ignite configuration and find zookeeper connection string
> 2)
> > > >> Make
> > > >>>> sure the Zookeeper is running
> > > >>>>> 10200 - Warning - Ignite node left the cluster.
> > > >>>>>
> > > >>>>> Back to Ignite: it looks to me we do not design for operations as
> > > >>>> described above. We have no event IDs: our logging is subject to
> > > change
> > > >> in
> > > >>>> new version so that any patterns DevOps might use to detect
> > > significant
> > > >>>> events would stop working after upgrade.
> > > >>>>>
> > > >>>>> If I am not the only one how have such concerns then we might
> open
> > a
> > > >>>> ticket to address this.
> > > >>>>>
> > > >>>>>
> > > >>>>> Best regards, Alexey
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Alexey Kuznetsov
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Ignite not friendly for Monitoring

Reply via email to