Dima, Please see latest comments in the ticket [1]. There is special specification called SQLSTATE governing what errors code are thrown from SQL operations [2]. This is applicable to both JDBC and ODBC. Apart of from standard code, database vendor can add it's own codes as a separate field, or even extend error codes from the standard. However, as a first iteration we should start respecting SQLSTATE spec without our own Ignite-specific error codes.
[1] https://issues.apache.org/jira/browse/IGNITE-5620 [2] https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html#db2z_sqlstatevalues__code07 On Mon, Aug 28, 2017 at 3:23 PM, Dmitriy Setrakyan <dsetrak...@apache.org> wrote: > On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > IGNITE-5620 is about error codes thrown from drivers. This is completely > > different story, as every driver has specification with it's own specific > > error codes. There is no common denominator. > > > > Vova, I am not sure I understand. I would expect that drivers should > provide the same SQL error codes as the underlying database. Perhaps, > drivers have their custom codes for the errors in the driver itself, not in > SQL. > > Can you please clarify? > > > > > > On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda <dma...@apache.org> wrote: > > > > > Vladimir, > > > > > > I would disagree. In IGNITE-5620 we’re going to introduce some constant > > > error codes and prepare a sheet that will elaborate on every error. > > That’s > > > a part of bigger endeavor when the whole platform should be covered by > > > special unique IDs for errors, warning and events. > > > > > > Now, we need to agree at least on the IDs range for SQL. > > > > > > — > > > Denis > > > > > > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov <voze...@gridgain.com> > > > wrote: > > > > > > > > Denis, > > > > > > > > IGNITE-5620 is completely different thing. Let's do not mix cluster > > > > monitoring and parser errors. > > > > > > > > ср, 16 авг. 2017 г. в 2:57, Denis Magda <dma...@apache.org>: > > > > > > > >> Alexey, > > > >> > > > >> Didn’t know that such an improvement as consistent IDs for errors > and > > > >> events can be used as an integration point with the DevOps tools. > > Thanks > > > >> for sharing your experience with us. > > > >> > > > >> Would you step in as a architect for this task and make out a JIRA > > > ticket > > > >> with all the required information. > > > >> > > > >> In general, we’ve already planned to do something around this > starting > > > >> with SQL: > > > >> https://issues.apache.org/jira/browse/IGNITE-5620 < > > > >> https://issues.apache.org/jira/browse/IGNITE-5620> > > > >> > > > >> It makes sense to consider your input before the work on IGNITE-5620 > > is > > > >> started. > > > >> > > > >> — > > > >> Denis > > > >> > > > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin < > > > >> alexeykukush...@yahoo.com.INVALID> wrote: > > > >>> > > > >>> Hi Alexey, > > > >>> A nice thing about delegating alerting to 3rd party enterprise > > systems > > > >> is that those systems already deal with lots of things including > > > >> distributed apps. > > > >>> What is needed from Ignite is to consistently write to log files > > (again > > > >> that means stable event IDs, proper event granularity, no > repetition, > > > >> documentation). This would be 3rd party monitoring system's > > > responsibility > > > >> to monitor log files on all nodes, filter, aggregate, process, > > visualize > > > >> and notify on events. > > > >>> How a monitoring tool would deal with an event like "node left": > > > >>> The only thing needed from Ignite is to write an entry like below > to > > > log > > > >> files on all Ignite servers. In this example 3300 identifies this > > "node > > > >> left" event and will never change in the future even if text > > description > > > >> changes: > > > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left > the > > > >> cluster > > > >>> Then we document somewhere on the web that Ignite has event 3300 > and > > it > > > >> means a node left the cluster. Maybe provide documentation how to > deal > > > with > > > >> it. Some examples:Oracle Web Cache events: > > > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/ > > > event.htm#sthref2393MS > > > >> SQL Server events: > > > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx > > > >>> That is all for Ignite! Everything else is handled by specific > > > >> monitoring system configured by DevOps on the customer side. > > > >>> Basing on the Ignite documentation similar to above, DevOps of a > > > company > > > >> where Ignite is going to be used will configure their monitoring > > system > > > to > > > >> understand Ignite events. Consider the "node left" event as an > > example. > > > >>> - This event is output on every node but DevOps do not want to be > > > >> notified many times. To address this, they will build an "Ignite > > model" > > > >> where there will be a parent-child dependency between components > > "Ignite > > > >> Cluster" and "Ignite Node". For example, this is how you do it in > > > Nagios: > > > >> https://assets.nagios.com/downloads/nagioscore/docs/ > > > nagioscore/4/en/dependencies.html > > > >> and this is how you do it in Microsoft SCSM: > > > >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. > > Then > > > >> DevOps will configure "node left" monitors in SCSM (or a "checks" in > > > >> Nagios) for parent "Ignite Cluster" and child "Ignite Service" > > > components. > > > >> State change (OK -> WARNING) and notification (email, SMS, whatever) > > > will > > > >> be configured only for the "Ignite Cluster"'s "node left" monitor.- > > Now > > > >> suppose a node left. The "node left" monitor (that uses log file > > > monitoring > > > >> plugin) on "Ignite Node" will detect the event and pass it to the > > > parent. > > > >> This will trigger "Ignite Cluster" state change from OK to WARNING > and > > > send > > > >> a notification. No more notification will be sent unless the "Ignite > > > >> Cluster" state is reset back to OK, which happens either manually or > > on > > > >> timeout or automatically on "node joined". > > > >>> This was just FYI. We, Ignite developers, do not care about how > > > >> monitoring works - this is responsibility of customer's DevOps. Our > > > >> responsibility is consistent event logging. > > > >>> Thank you! > > > >>> > > > >>> > > > >>> Best regards, Alexey > > > >>> > > > >>> > > > >>> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov < > > > >> akuznet...@apache.org> wrote: > > > >>> > > > >>> Alexey, > > > >>> > > > >>> How you are going to deal with distributed nature of Ignite > cluster? > > > >>> And how do you propose handle nodes restart / stop? > > > >>> > > > >>> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin < > > > >>> alexeykukush...@yahoo.com.invalid> wrote: > > > >>> > > > >>>> Hi Denis, > > > >>>> Monitoring tools simply watch event logs for patterns (regex in > case > > > of > > > >>>> unstructured logs like text files). A stable (not changing in new > > > >> releases) > > > >>>> event ID identifying specific issue would be such a pattern. > > > >>>> We need to introduce such event IDs according to the principles I > > > >>>> described in my previous mail. > > > >>>> Best regards, Alexey > > > >>>> > > > >>>> > > > >>>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda < > > > >>>> dma...@apache.org> wrote: > > > >>>> > > > >>>> Hello Alexey, > > > >>>> > > > >>>> Thanks for the detailed input. > > > >>>> > > > >>>> Assuming that Ignite supported the suggested events based model, > how > > > can > > > >>>> it be integrated with mentioned tools like DynaTrace or Nagios? Is > > > this > > > >> all > > > >>>> we need? > > > >>>> > > > >>>> — > > > >>>> Denis > > > >>>> > > > >>>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin < > > > >> alexeykukush...@yahoo.com > > > >>>> .INVALID> wrote: > > > >>>>> > > > >>>>> Igniters, > > > >>>>> While preparing some Ignite materials for Administrators I found > > > Ignite > > > >>>> is not friendly for such a critical DevOps practice as monitoring. > > > >>>>> TL;DRI think Ignite misses structured descriptions of abnormal > > events > > > >>>> with references to event IDs in the logs not changing as new > > versions > > > >> are > > > >>>> released. > > > >>>>> MORE DETAILS > > > >>>>> I call an application “monitoring friendly” if it allows DevOps > to: > > > >>>>> 1. immediately receive a notification (email, SMS, etc.) > > > >>>>> 2. understand what a problem is without involving developers > > > >>>>> 3. provide automated recovery action. > > > >>>>> > > > >>>>> Large enterprises do not implement custom solutions. They usually > > use > > > >>>> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in > the > > > >>>> enterprise consistently. All such tools have similar architecture > > > >> providing > > > >>>> a dashboard showing apps as “green/yellow/red”, and numerous > > > >> “connectors” > > > >>>> to look for events in text logs, ESBs, database tables, etc. > > > >>>>> > > > >>>>> For each app DevOps build a “health model” - a diagram displaying > > the > > > >>>> app’s “manageable” components and the app boundaries. A > “manageable” > > > >>>> component is something that can be started/stopped/configured in > > > >> isolation. > > > >>>> “System boundary” is a list of external apps that the monitored > app > > > >>>> interacts with. > > > >>>>> > > > >>>>> The main attribute of a manageable component is a list of > > > >> “operationally > > > >>>> significant events”. Those are the events that DevOps can do > > something > > > >>>> with. For example, “failed to connect to cache store” is > > significant, > > > >> while > > > >>>> “user input validation failed” is not. > > > >>>>> > > > >>>>> Events shall be as specific as possible so that DevOps do not > spend > > > >> time > > > >>>> for further analysis. For example, a “database failure” event is > not > > > >> good. > > > >>>> There should be “database connection failure”, “invalid database > > > >> schema”, > > > >>>> “database authentication failure”, etc. events. > > > >>>>> > > > >>>>> “Event” is NOT the same as exception occurred in the code. Events > > > >>>> identify specific problem from the DevOps point of view. For > > example, > > > >> even > > > >>>> if “connection to cache store failed” exception might be thrown > from > > > >>>> several places in the code, that is still the same event. On the > > other > > > >>>> side, even if a SqlServerConnectionTimeout and > > OracleConnectionTimeout > > > >>>> exceptions might be caught in the same place, those are different > > > events > > > >>>> since MS SQL Server and Oracle are usually different DevOps groups > > in > > > >> large > > > >>>> enterprises! > > > >>>>> > > > >>>>> The operationally significant event IDs must be stable: they must > > not > > > >>>> change from one release to another. This is like a contract > between > > > >>>> developers and DevOps. > > > >>>>> > > > >>>>> This should be the developer’s responsibility to publish and > > > maintain a > > > >>>> table with attributes: > > > >>>>> > > > >>>>> - Event ID > > > >>>>> - Severity: Critical (Red) - the system is not operational; > Warning > > > >>>> (Yellow) - the system is operational but health is degraded; None > - > > > >> just an > > > >>>> info. > > > >>>>> - Description: concise but enough for DevOps to act without > > > developer’s > > > >>>> help > > > >>>>> - Recovery actions: what DevOps shall do to fix the issue without > > > >>>> developer’s help. DevOps might create automated recovery scripts > > based > > > >> on > > > >>>> this information. > > > >>>>> > > > >>>>> For example: > > > >>>>> 10100 - Critical - Could not connect to Zookeeper to discovery > > nodes > > > - > > > >>>> 1) Open ignite configuration and find zookeeper connection string > 2) > > > >> Make > > > >>>> sure the Zookeeper is running > > > >>>>> 10200 - Warning - Ignite node left the cluster. > > > >>>>> > > > >>>>> Back to Ignite: it looks to me we do not design for operations as > > > >>>> described above. We have no event IDs: our logging is subject to > > > change > > > >> in > > > >>>> new version so that any patterns DevOps might use to detect > > > significant > > > >>>> events would stop working after upgrade. > > > >>>>> > > > >>>>> If I am not the only one how have such concerns then we might > open > > a > > > >>>> ticket to address this. > > > >>>>> > > > >>>>> > > > >>>>> Best regards, Alexey > > > >>>> > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> Alexey Kuznetsov > > > >> > > > >> > > > > > > > > >