Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Chris Hostetter Fri, 31 May 2019 14:56:46 -0700

: Hoss, I see several of these failures popping up, probably related to 
: timing of the config reload across nodes. Should we as a phase 1 
: introduce a simple sleep to harden those tests and follow up later with 
: APIs that support waiting until config propagates?


Well, I personally refuse to add any sleep calls to any tests -- but 
that's my personal opinion.  You and others may have your own opinions and 
take differnet actions then i would take :)

https://twitter.com/_hossman/status/974743183044128768

: 
: Jan Høydahl
: 
: > 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <[email protected]>:
: > 
: > 
: >    [ 
https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697
 ] 
: > 
: > Hoss Man commented on SOLR-13464:
: > ---------------------------------
: > 
: > In theory it would be possible for a test client (or any real production 
client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify that 
they are using the updated security settings, because the behavior of 
SecurityConfHandlerZk on GET is to read the _cached_ security props from the 
ZkStateReader, so in theory it's only updated once it's been force refreshed by 
the zk watcher ... but this still has 2 problems:
: > # any client doing this would have to be statefull and know what the most 
recent setting(s) change was, so it could assert those specific settings have 
been updated. There's no way for a "dumb" client to simply ask "is your current 
config up to date w/zk". Even if the client directly polled ZK to see what the 
current version is in the authoritative {{/security.json}} for the cluster, the 
"version" info isn't included in the {{GET /admin/auth...}} responses, so it 
would have to do a "deep comparison" of the entire JSON response.
: > # even if client knows what data to expect from a {{GET /admin/auth...}} 
request when polling all/any nodes in the cluster (either from first hand 
knowledge because it was the client that did the last POST, or second hand 
knowledge from querying ZK directly) and even if the expected data is returned 
by every node, that doesn't mean it's in *USE* yet – there is inherient lag 
between when the security conf data is "refreshed" in the ZkStateReader (on 
each node) and when the plugin Object instance are actually initialized and 
become active on each node.
: > 
: > ----
: > Here's a strawman proposal for a possible solution to this problem – both 
for use in tests, and for end users, that might want to verify when updated 
settings are in really enabled...
: > # refactor CoreContainer so that methods like {{public AuthorizationPlugin 
getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public 
SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods so 
that callers can read the znode version used to init the plugin
: > # refactor {{SecurityConfHandler.getPlugin(String)}} to be a 
deprecated/syntactic sugar for a new version that returns 
{{SecurityPluginHolder<?>}}
: > # update {{SecurityConfHandlerZk.getConf}} so that it:
: > ** uses {{getSecurityConfig(true)}} to ensure it reads the most current 
settings from ZK, (instead of the cached copy used by the current code).
: > ** adds the {{SecurityConfig.getVersion()}} number in the response (in 
addition to the config data) ... perhaps as {{key + ".conf.version"}}
: > ** when {{getPlugin(key)}} is non null, include the 
{{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + 
".enabled.version"}}
: > 
: > ...that way a dumb client can easily poll any/all node(s) for 
{{/admin/auth_foo}} until the {{auth_foo.conf.version}} and 
{{auth_foo.enabled.version}} are identical to know when the most recent 
{{auth_foo}} settings in ZK's security.json are actaully in use.
: > 
: > (We could potentially take things even a step further, and add something 
like a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, 
that would federate {{GET /admin/auth...}} to every (live?) node in the 
cluster, and include map of nodeName => enabled.version for every node ... 
maybe?)
: > 
: > Thoughts?
: > 
: >> Sporadic Auth + Cloud test failures, probably due to lag in nodes 
reloading security config
: >> 
-------------------------------------------------------------------------------------------
: >> 
: >>                Key: SOLR-13464
: >>                URL: https://issues.apache.org/jira/browse/SOLR-13464
: >>            Project: Solr
: >>         Issue Type: Bug
: >>     Security Level: Public(Default Security Level. Issues are Public) 
: >>           Reporter: Hoss Man
: >>           Priority: Major
: >> 
: >> I've been investigating some sporadic and hard to reproduce test failures 
related to authentication in cloud mode, and i *think* (but have not directly 
verified) that the common cause is that after uses one of the 
{{/admin/auth...}} handlers to update some setting, there is an inherient and 
unpredictible delay (due to ZK watches) until every node in the cluster has had 
a chance to (re)load the new configuration and initialize the various security 
plugins with the new settings.
: >> Which means, if a test client does a POST to some node to 
add/change/remove some authn/authz settings, and then immediately hits the 
exact same node (or any other node) to test that the effects of those settings 
exist, there is no garuntee that they will have taken affect yet.
: > 
: > 
: > 
: > --
: > This message was sent by Atlassian JIRA
: > (v7.6.3#76005)
: > 
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [email protected]
: > For additional commands, e-mail: [email protected]
: > 
: 

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Reply via email to