Hoss, I see several of these failures popping up, probably related to timing of the config reload across nodes. Should we as a phase 1 introduce a simple sleep to harden those tests and follow up later with APIs that support waiting until config propagates?
Jan Høydahl > 11. mai 2019 kl. 01:46 skrev Hoss Man (JIRA) <[email protected]>: > > > [ > https://issues.apache.org/jira/browse/SOLR-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837697#comment-16837697 > ] > > Hoss Man commented on SOLR-13464: > --------------------------------- > > In theory it would be possible for a test client (or any real production > client) to poll {{/admin/auth...}} on all/any nodes in a cluster to verify > that they are using the updated security settings, because the behavior of > SecurityConfHandlerZk on GET is to read the _cached_ security props from the > ZkStateReader, so in theory it's only updated once it's been force refreshed > by the zk watcher ... but this still has 2 problems: > # any client doing this would have to be statefull and know what the most > recent setting(s) change was, so it could assert those specific settings have > been updated. There's no way for a "dumb" client to simply ask "is your > current config up to date w/zk". Even if the client directly polled ZK to see > what the current version is in the authoritative {{/security.json}} for the > cluster, the "version" info isn't included in the {{GET /admin/auth...}} > responses, so it would have to do a "deep comparison" of the entire JSON > response. > # even if client knows what data to expect from a {{GET /admin/auth...}} > request when polling all/any nodes in the cluster (either from first hand > knowledge because it was the client that did the last POST, or second hand > knowledge from querying ZK directly) and even if the expected data is > returned by every node, that doesn't mean it's in *USE* yet – there is > inherient lag between when the security conf data is "refreshed" in the > ZkStateReader (on each node) and when the plugin Object instance are actually > initialized and become active on each node. > > ---- > Here's a strawman proposal for a possible solution to this problem – both for > use in tests, and for end users, that might want to verify when updated > settings are in really enabled... > # refactor CoreContainer so that methods like {{public AuthorizationPlugin > getAuthorizationPlugin()}} are deprecated/syntactic sugar for new {{public > SecurityPluginHolder<AuthorizationPlugin> getAuthorizationPlugin()}} methods > so that callers can read the znode version used to init the plugin > # refactor {{SecurityConfHandler.getPlugin(String)}} to be a > deprecated/syntactic sugar for a new version that returns > {{SecurityPluginHolder<?>}} > # update {{SecurityConfHandlerZk.getConf}} so that it: > ** uses {{getSecurityConfig(true)}} to ensure it reads the most current > settings from ZK, (instead of the cached copy used by the current code). > ** adds the {{SecurityConfig.getVersion()}} number in the response (in > addition to the config data) ... perhaps as {{key + ".conf.version"}} > ** when {{getPlugin(key)}} is non null, include the > {{SecurityPluginHolder.getVersion()}} in the response ... perhaps as {{key + > ".enabled.version"}} > > ...that way a dumb client can easily poll any/all node(s) for > {{/admin/auth_foo}} until the {{auth_foo.conf.version}} and > {{auth_foo.enabled.version}} are identical to know when the most recent > {{auth_foo}} settings in ZK's security.json are actaully in use. > > (We could potentially take things even a step further, and add something like > a {{verify.cluster.version=true|false}} option to SecurityConfHandlerZk, that > would federate {{GET /admin/auth...}} to every (live?) node in the cluster, > and include map of nodeName => enabled.version for every node ... maybe?) > > Thoughts? > >> Sporadic Auth + Cloud test failures, probably due to lag in nodes reloading >> security config >> ------------------------------------------------------------------------------------------- >> >> Key: SOLR-13464 >> URL: https://issues.apache.org/jira/browse/SOLR-13464 >> Project: Solr >> Issue Type: Bug >> Security Level: Public(Default Security Level. Issues are Public) >> Reporter: Hoss Man >> Priority: Major >> >> I've been investigating some sporadic and hard to reproduce test failures >> related to authentication in cloud mode, and i *think* (but have not >> directly verified) that the common cause is that after uses one of the >> {{/admin/auth...}} handlers to update some setting, there is an inherient >> and unpredictible delay (due to ZK watches) until every node in the cluster >> has had a chance to (re)load the new configuration and initialize the >> various security plugins with the new settings. >> Which means, if a test client does a POST to some node to add/change/remove >> some authn/authz settings, and then immediately hits the exact same node (or >> any other node) to test that the effects of those settings exist, there is >> no garuntee that they will have taken affect yet. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
