Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Jan Høydahl Fri, 31 May 2019 13:49:26 -0700

Hoss, I see several of these failures popping up, probably related to timing of 
the config reload across nodes. Should we as a phase 1 introduce a simple sleep 
to harden those tests and follow up later with APIs that support waiting until 
config propagates?


Jan Høydahl

> 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <[email protected]>:
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697
>  ] 
> 
> Hoss Man commented on SOLR-13464:
> ---------------------------------
> 
> In theory it would be possible for a test client (or any real production 
> client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify 
> that they are using the updated security settings, because the behavior of 
> SecurityConfHandlerZk on GET is to read the _cached_ security props from the 
> ZkStateReader, so in theory it's only updated once it's been force refreshed 
> by the zk watcher ... but this still has 2 problems:
> # any client doing this would have to be statefull and know what the most 
> recent setting(s) change was, so it could assert those specific settings have 
> been updated. There's no way for a "dumb" client to simply ask "is your 
> current config up to date w/zk". Even if the client directly polled ZK to see 
> what the current version is in the authoritative {{/security.json}} for the 
> cluster, the "version" info isn't included in the {{GET /admin/auth...}} 
> responses, so it would have to do a "deep comparison" of the entire JSON 
> response.
> # even if client knows what data to expect from a {{GET /admin/auth...}} 
> request when polling all/any nodes in the cluster (either from first hand 
> knowledge because it was the client that did the last POST, or second hand 
> knowledge from querying ZK directly) and even if the expected data is 
> returned by every node, that doesn't mean it's in *USE* yet – there is 
> inherient lag between when the security conf data is "refreshed" in the 
> ZkStateReader (on each node) and when the plugin Object instance are actually 
> initialized and become active on each node.
> 
> ----
> Here's a strawman proposal for a possible solution to this problem – both for 
> use in tests, and for end users, that might want to verify when updated 
> settings are in really enabled...
> # refactor CoreContainer so that methods like {{public AuthorizationPlugin 
> getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public 
> SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods 
> so that callers can read the znode version used to init the plugin
> # refactor {{SecurityConfHandler.getPlugin(String)}} to be a 
> deprecated/syntactic sugar for a new version that returns 
> {{SecurityPluginHolder<?>}}
> # update {{SecurityConfHandlerZk.getConf}} so that it:
> ** uses {{getSecurityConfig(true)}} to ensure it reads the most current 
> settings from ZK, (instead of the cached copy used by the current code).
> ** adds the {{SecurityConfig.getVersion()}} number in the response (in 
> addition to the config data) ... perhaps as {{key + ".conf.version"}}
> ** when {{getPlugin(key)}} is non null, include the 
> {{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + 
> ".enabled.version"}}
> 
> ...that way a dumb client can easily poll any/all node(s) for 
> {{/admin/auth_foo}} until the {{auth_foo.conf.version}} and 
> {{auth_foo.enabled.version}} are identical to know when the most recent 
> {{auth_foo}} settings in ZK's security.json are actaully in use.
> 
> (We could potentially take things even a step further, and add something like 
> a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, that 
> would federate {{GET /admin/auth...}} to every (live?) node in the cluster, 
> and include map of nodeName => enabled.version for every node ... maybe?)
> 
> Thoughts?
> 
>> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading 
>> security config
>> -------------------------------------------------------------------------------------------
>> 
>>                Key: SOLR-13464
>>                URL: https://issues.apache.org/jira/browse/SOLR-13464
>>            Project: Solr
>>         Issue Type: Bug
>>     Security Level: Public(Default Security Level. Issues are Public) 
>>           Reporter: Hoss Man
>>           Priority: Major
>> 
>> I've been investigating some sporadic and hard to reproduce test failures 
>> related to authentication in cloud mode, and i *think* (but have not 
>> directly verified) that the common cause is that after uses one of the 
>> {{/admin/auth...}} handlers to update some setting, there is an inherient 
>> and unpredictible delay (due to ZK watches) until every node in the cluster 
>> has had a chance to (re)load the new configuration and initialize the 
>> various security plugins with the new settings.
>> Which means, if a test client does a POST to some node to add/change/remove 
>> some authn/authz settings, and then immediately hits the exact same node (or 
>> any other node) to test that the effects of those settings exist, there is 
>> no garuntee that they will have taken affect yet.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [jira] [Commented] (SOLR-13464) Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading security config

Reply via email to