Vadim Spector created SENTRY-1866:
-------------------------------------

             Summary: Log Sentry server failover events on Sentry clients in HA 
scenario
                 Key: SENTRY-1866
                 URL: https://issues.apache.org/jira/browse/SENTRY-1866
             Project: Sentry
          Issue Type: Improvement
            Reporter: Vadim Spector


Sentry HA-specific: when the Sentry client fails over from one sentry server to 
the other, it does not print a message that it has done so. Have the generic 
client print a simple, clear INFO level message when the client fails over form 
one Sentry server to another.

Design considerations:

"Sentry client" stands for a specific class instance capable of connecting to a 
specific Sentry server instance from some app (usually another Hadoop service). 
In HA scenario, Sentry client relies on connection pooling (SentryTransportPool 
class) to select one of several available configured Sentry server instances. 
Whenever connection fails, Sentry client simply asks SentryTransportPool to a) 
invalidate this specific connection and b) get another connection instead. 
There is no monitoring of Sentry server liveliness per se. Each Sentry client 
finds out about a failure independently and only at the time of trying to use 
it. Thus there may be no particular correlation between the time of the 
discovery of connection failure and the time Sentry server actually becomes 
unavailable. E.g. a client can discover a failure of the old connection, long 
after Sentry server crushed and then was restarted (and maybe restarted more 
than once!).

Intuitively, one would like yto have a single log per Sentry server 
crush/shutdown; but due to the explanations above, it seems difficult, if not 
impossible, to group the connections by instance(s) of Sentry server when these 
connections were initiated. Therefore, it may be challenging to say whether 
multiple connection failures have to do with "the same" Sentry server instance 
going down. Therefore, it is difficult to report exactly one connection failure 
per one Sentry server shutdown/crush event.

Yet, the desire to have visibility into such events in the field is 
understandable. At the same time, if we simply log every connection failure, 
such logging can be massive - there may be many concurrent connections to 
Sentry server(s) from the same app. Such logging would be less than useful.

The solution is required to use some less than perfect rules, by which the 
number of connection failure logs can be contained. The alternative solution of 
introducing periodic pinging of Sentry server and only logging pinging failures 
would be possible as well (and it would be awesome if Sentry server responded 
to pings with the server-id initialized as the server start time stamp - this 
would totally solve the problem), but requires more radical changes.

The simplest solution seems to be as follows: since the recovery of the failed 
Sentry serve is likely to take some time, we do not need to be too clever; it 
may just be enough to report each connection failure to a given Sentry instance 
no more often than once every N (configurable value) seconds. If one connection 
failure to Sentry server instance X has been reported, another one won't be 
reported before N seconds expire. This will keep the number of connection 
failure messages at bay. Such logs may still be confusing, if a client attempts 
to use some old connection from the old server instance after some idle period.

Alternative suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to