Re: [Discuss] Datasource v2 support for Kerberos
On 2 Oct 2018, at 04:44, tigerquoll mailto:tigerqu...@outlook.com>> wrote: Hi Steve, I think that passing a kerberos keytab around is one of those bad ideas that is entirely appropriate to re-question every single time you come across it. It has been used already in spark when interacting with Kerberos systems that do not support delegation tokens. Any such system will eventually stop talking to Spark once the passed Kerberos tickets expire and are unable to be renewed. It is one of those "best bad idea we have" type situations that has arisen, been discussed to death, and finally, grudgingly, an interim-only solution settled on as passing the keytab to the worker to renew Kerberos tickets. Spark AM, generally, with it pushing out tickets to the workers, I don't believe the workers get to see the keytab —do they? Gabor's illustration in the kafka SPIP is probably the best illustration of it I've ever seen https://docs.google.com/document/d/1ouRayzaJf_N5VQtGhVq9FURXVmRpXzEEWYHob0ne3NY/edit# A long-time notable offender in this area is secure Kafka. Thankfully Kafka delegation tokens are soon to be supported in spark, removing the need to pass keytabs around when interacting with Kafka. This particular thread could probably be better renamed as Generic Datasource v2 support for Kerberos configuration - I would like to divert from conversation on alternate architectures that could handle a lack of delegation tickets (it is a worthwhile conversation, but a long and involved one that will distract from this particular narrowly defined topic), and focus just on configuration. information. A very quick look through various client code has identified at least the following configuration information that potentially could be of use to a datasource that uses Kerberos. * krb5ConfPath * kerberos debugging flags mmm. https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html FWIW, Hadoop 2.8+ has the KDiag entry point which can also be run inside an application —though there's always the risk that going near UGI too early can "collapse" kerberos state too early https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/KDiag.java if Spark needs something like that for 2.7.x too, copying & repackaging that class would be a place to start * spark.security.credentials.${service}.enabled * JAAS config * ZKServerPrincipal ?? It is entirely feasible that each datasource may require its own unique Kerberos configuration (e.g. You are pulling from a external datasource that has a different KDC then the yarn cluster you are running on). This is a use-case I've never encountered, instead everyone relies on cross-AD trust. That's complex enough as it is
Re: [Discuss] Datasource v2 support for Kerberos
> On 25 Sep 2018, at 07:52, tigerquoll wrote: > > To give some Kerberos specific examples, The spark-submit args: > -–conf spark.yarn.keytab=path_to_keytab -–conf > spark.yarn.principal=princi...@realm.com > > are currently not passed through to the data sources. > > > I'm not sure why the data sources would need to know the kerberos login details, certainly I wouldn't give them the keytab path (or indeed, access to it), and as for the principal, UserGroupInformation getCurrentUser() should return that, including with support for UGI.doAs() and the ability to issue calls as different users from same process. I'd also be reluctant to blindly pass on kerberos secrets over the network. What does matter is that code interacting with a data source, dest, filesystem, etc should be executing it in the context of the intended caller, which UGI getCurrentUser() should do. What does matter is that whatever authentication information is needed to authenticate with a data source is passed to it. That's done in the spark submit code for yarn by asking the filesystems, hive & hbase; I don't know about zookeeper there. I think what might be good here is to enumerate what datasources are expected to need from kerberos (JIRA? google doc), and from any forms of service tokens, then see how they could be handled in a way which fits into the existing world of Kerberos ticket & Hadoop service token creation on submission or in job driver, and handoff to workers which need them -Steve - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Discuss] Datasource v2 support for Kerberos
I agree with Wenchen that we'd remove the prefix when passing to a source, so you could use the same "spark.yarn.keytab" option in both places. But I think the problem is that "spark.yarn.keytab" still needs to be set, and it clearly isn't in a shared namespace for catalog options. So I think we would still need a solution for existing options. I'm more comfortable with a white list for existing options that we want to maintain compatibility with. rb On Mon, Sep 24, 2018 at 11:52 PM tigerquoll wrote: > To give some Kerberos specific examples, The spark-submit args: > -–conf spark.yarn.keytab=path_to_keytab -–conf > spark.yarn.principal=princi...@realm.com > > are currently not passed through to the data sources. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix
Re: [Discuss] Datasource v2 support for Kerberos
To give some Kerberos specific examples, The spark-submit args: -–conf spark.yarn.keytab=path_to_keytab -–conf spark.yarn.principal=princi...@realm.com are currently not passed through to the data sources. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Discuss] Datasource v2 support for Kerberos
> All of the Kerberos options already exist in their own legacy locations though - changing their location could break a lot of systems. We can define the prefix for shared options, and we can strip the prefix when passing these options to the data source. Will this work for your case? On Tue, Sep 25, 2018 at 12:57 PM tigerquoll wrote: > I like the shared namespace option better then the white listing option for > any newly defined configuration information. > > All of the Kerberos options already exist in their own legacy locations > though - changing their location could break a lot of systems. > > Perhaps we can use the shared namespace option for any new option and > whitelisting for the existing options? > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [Discuss] Datasource v2 support for Kerberos
I like the shared namespace option better then the white listing option for any newly defined configuration information. All of the Kerberos options already exist in their own legacy locations though - changing their location could break a lot of systems. Perhaps we can use the shared namespace option for any new option and whitelisting for the existing options? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Discuss] Datasource v2 support for Kerberos
Dale, what do you think about the option that I suggested? I think that's different from the ones that you just listed. Basically, the idea is to have a "shared" set of options that are passed to all sources. This would not be a whitelist, it would be a namespace that ends up passed in everywhere. That way, kerberos options would be set in the shared space, but could be set directly if you want to override. The problem I have with your option 1 is that it requires a whiltelist, which is difficult to maintain and doesn't have obvious behavior. If a user wants to share an option, it has to be a special one. Otherwise the user has to wait until we add it to a whitelist, which is slow. I don't think your option 2 works because that's no better than what we do today. And as you said, isolating config is a good goal. Your option 3 is basically a whitelist, but with additional interfaces to activate the option sets to forward. I think that's a bit too intrusive and shares the problems that a whitelist has. The option I'm proposing gets around those issues because it is obvious what is happening. Any option under the shared namespace is copied to all sources and catalogs. That doesn't require Spark to do anything to support specific sets of options and is predictable behavior for users to understand. It also allows us to maintain separation instead of passing all options. I think this is a good option overall. What do you think? rb On Sun, Sep 23, 2018 at 5:21 PM tigerquoll wrote: > I believe the current spark config system is unfortunate in the way it has > grown - you have no way of telling which sub-systems uses which > configuration options without direct and detailed reading of the code. > > Isolating config items for datasources into a separate namespaces (rather > then using a whitelist), is a nice idea - unfortunately in this case we are > dealing with configuration items that have been exposed to end-users in > their current from for a significant amount of time, and Kerberos > cross-cuts > not only datasources, but also things like YARN. > > So given that fact - the best options of a way forward I can think of are: > 1. Whitelisting of specific sub sections of the configuration space, or > 2. Just pass in a Map[String,String] of all config values > 3. Implement a specific interface for data sources to indicate/implement > Kerberos support > > Option (1), is pretty arbitrary, and more then likely the whitelist will > change from version to version as additional items get added to it. Data > sources will develop dependencies on certain configuration values being > present in the white list. > > Option (2) would work, but continues the practice of having a vaguely > specified grab-bag of config items as a dependency for practically all > Spark > code. > > I am beginning to to warm to option (3), it would be a clean way of > declaring that a data source supports Kerberos, and also a cleanly > specified > way of injecting the relevant Kerberos configuration information into the > data source - and we will not need to change any user-facing configuration > items as well. > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix
Re: [Discuss] Datasource v2 support for Kerberos
I believe the current spark config system is unfortunate in the way it has grown - you have no way of telling which sub-systems uses which configuration options without direct and detailed reading of the code. Isolating config items for datasources into a separate namespaces (rather then using a whitelist), is a nice idea - unfortunately in this case we are dealing with configuration items that have been exposed to end-users in their current from for a significant amount of time, and Kerberos cross-cuts not only datasources, but also things like YARN. So given that fact - the best options of a way forward I can think of are: 1. Whitelisting of specific sub sections of the configuration space, or 2. Just pass in a Map[String,String] of all config values 3. Implement a specific interface for data sources to indicate/implement Kerberos support Option (1), is pretty arbitrary, and more then likely the whitelist will change from version to version as additional items get added to it. Data sources will develop dependencies on certain configuration values being present in the white list. Option (2) would work, but continues the practice of having a vaguely specified grab-bag of config items as a dependency for practically all Spark code. I am beginning to to warm to option (3), it would be a clean way of declaring that a data source supports Kerberos, and also a cleanly specified way of injecting the relevant Kerberos configuration information into the data source - and we will not need to change any user-facing configuration items as well. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [Discuss] Datasource v2 support for Kerberos
I’m not a huge fan of special cases for configuration values like this. Is there something that we can do to pass a set of values to all sources (and catalogs for #21306)? I would prefer adding a special prefix for options that are passed to all sources, like this: spark.sql.catalog.shared.shared-property = value0 spark.sql.catalog.jdbc-prod.prop = value1 spark.datasource.source-name.prop = value2 All of the properties in the shared namespace would be passed to all catalogs and sources. What do you think? On Sun, Sep 16, 2018 at 6:51 PM Wenchen Fan wrote: > I'm +1 for this proposal: "Extend SessionConfigSupport to support passing > specific white-listed configuration values" > > One goal of data source v2 API is to not depend on any high-level APIs > like SparkSession, SQLConf, etc. If users do want to access these > high-level APIs, there is a workaround: calling `SparkSession.getActive` or > `SQLConf.get`. > > In the meanwhile, I think you use case makes sense. `SessionConfigSupport` > is created for this use case but it's not powerful enough yet. I think it > should support multiple key-prefixes and white-list. > > Feel free to submit a patch, and thanks for looking into it! > > On Sun, Sep 16, 2018 at 2:40 PM tigerquoll wrote: > >> The current V2 Datasource API provides support for querying a portion of >> the >> SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport >> API. >> This was designed with the assumption that all configuration information >> for >> v2 data sources should be separate from each other. >> >> Unfortunately, there are some cross-cutting concerns such as >> authentication >> that touch multiple data sources - this means that common configuration >> items need to be shared amongst multiple data sources. >> In particular, Kerberos setup can use the following configuration items: >> >> * userPrincipal, >> * userKeytabPath >> * krb5ConfPath >> * kerberos debugging flags >> * spark.security.credentials.${service}.enabled >> * JAAS config >> * ZKServerPrincipal ?? >> >> So potential solutions I can think of to pass this information to various >> data sources are: >> >> * Pass the entire SparkContext object to data sources (not likely) >> * Pass the entire SparkConfig Map object to data sources >> * Pass all required configuration via environment variables >> * Extend SessionConfigSupport to support passing specific white-listed >> configuration values >> * Add a specific data source v2 API "SupportsKerberos" so that a data >> source >> can indicate that it supports Kerberos and also provide the means to pass >> needed configuration info. >> * Expand out all Kerberos configuration items to be in each data source >> config namespace that needs it. >> >> If the data source requires TLS support then we also need to support >> passing >> all the configuration values under "spark.ssl.*" >> >> What do people think? Placeholder Issue has been added at SPARK-25329. >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix
Re: [Discuss] Datasource v2 support for Kerberos
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing specific white-listed configuration values" One goal of data source v2 API is to not depend on any high-level APIs like SparkSession, SQLConf, etc. If users do want to access these high-level APIs, there is a workaround: calling `SparkSession.getActive` or `SQLConf.get`. In the meanwhile, I think you use case makes sense. `SessionConfigSupport` is created for this use case but it's not powerful enough yet. I think it should support multiple key-prefixes and white-list. Feel free to submit a patch, and thanks for looking into it! On Sun, Sep 16, 2018 at 2:40 PM tigerquoll wrote: > The current V2 Datasource API provides support for querying a portion of > the > SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport > API. > This was designed with the assumption that all configuration information > for > v2 data sources should be separate from each other. > > Unfortunately, there are some cross-cutting concerns such as authentication > that touch multiple data sources - this means that common configuration > items need to be shared amongst multiple data sources. > In particular, Kerberos setup can use the following configuration items: > > * userPrincipal, > * userKeytabPath > * krb5ConfPath > * kerberos debugging flags > * spark.security.credentials.${service}.enabled > * JAAS config > * ZKServerPrincipal ?? > > So potential solutions I can think of to pass this information to various > data sources are: > > * Pass the entire SparkContext object to data sources (not likely) > * Pass the entire SparkConfig Map object to data sources > * Pass all required configuration via environment variables > * Extend SessionConfigSupport to support passing specific white-listed > configuration values > * Add a specific data source v2 API "SupportsKerberos" so that a data > source > can indicate that it supports Kerberos and also provide the means to pass > needed configuration info. > * Expand out all Kerberos configuration items to be in each data source > config namespace that needs it. > > If the data source requires TLS support then we also need to support > passing > all the configuration values under "spark.ssl.*" > > What do people think? Placeholder Issue has been added at SPARK-25329. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
[Discuss] Datasource v2 support for Kerberos
The current V2 Datasource API provides support for querying a portion of the SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API. This was designed with the assumption that all configuration information for v2 data sources should be separate from each other. Unfortunately, there are some cross-cutting concerns such as authentication that touch multiple data sources - this means that common configuration items need to be shared amongst multiple data sources. In particular, Kerberos setup can use the following configuration items: * userPrincipal, * userKeytabPath * krb5ConfPath * kerberos debugging flags * spark.security.credentials.${service}.enabled * JAAS config * ZKServerPrincipal ?? So potential solutions I can think of to pass this information to various data sources are: * Pass the entire SparkContext object to data sources (not likely) * Pass the entire SparkConfig Map object to data sources * Pass all required configuration via environment variables * Extend SessionConfigSupport to support passing specific white-listed configuration values * Add a specific data source v2 API "SupportsKerberos" so that a data source can indicate that it supports Kerberos and also provide the means to pass needed configuration info. * Expand out all Kerberos configuration items to be in each data source config namespace that needs it. If the data source requires TLS support then we also need to support passing all the configuration values under "spark.ssl.*" What do people think? Placeholder Issue has been added at SPARK-25329. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org