Re: [Discuss] Datasource v2 support for Kerberos

2018-10-02 Thread Steve Loughran


On 2 Oct 2018, at 04:44, tigerquoll 
mailto:tigerqu...@outlook.com>> wrote:

Hi Steve,
I think that passing a kerberos keytab around is one of those bad ideas that
is entirely appropriate to re-question every single time you come across it.
It has been used already in spark when interacting with Kerberos systems
that do not support delegation tokens. Any such system will eventually stop
talking to Spark once the passed Kerberos tickets expire and are unable to
be renewed.

It is one of those "best bad idea we have" type situations that has arisen,
been discussed to death, and finally, grudgingly, an interim-only solution
settled on as passing the keytab to the worker to renew Kerberos tickets.

Spark AM, generally, with it pushing out tickets to the workers,  I don't 
believe the workers get to see the keytab —do they?

Gabor's illustration in the kafka SPIP is probably the best illustration of it 
I've ever seen
https://docs.google.com/document/d/1ouRayzaJf_N5VQtGhVq9FURXVmRpXzEEWYHob0ne3NY/edit#


A
long-time notable offender in this area is secure Kafka. Thankfully Kafka
delegation tokens are soon to be supported in spark, removing the need to
pass keytabs around when interacting with Kafka.

This particular thread could probably be better renamed as Generic
Datasource v2 support for Kerberos configuration - I would like to divert
from conversation on alternate architectures that could handle a lack of
delegation tickets (it is a worthwhile conversation, but a long and involved
one that will distract from this particular narrowly defined topic), and
focus just on configuration. information.   A very quick look through
various client code has identified at least the following configuration
information that potentially could be of use to a datasource that uses
Kerberos.

* krb5ConfPath
* kerberos debugging flags

mmm. 
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html

FWIW, Hadoop 2.8+ has the KDiag entry point which can also be run inside an 
application —though there's always the risk that going near UGI too early can 
"collapse" kerberos state too early

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/KDiag.java

if Spark needs something like that for 2.7.x too, copying & repackaging that 
class would be a place to start


* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

It is entirely feasible that each datasource may require its own unique
Kerberos configuration (e.g. You are pulling from a external datasource that
has a different KDC then the yarn cluster you are running on).

This is a use-case I've never encountered, instead everyone relies on cross-AD 
trust. That's complex enough as it is


Re: [Discuss] Datasource v2 support for Kerberos

2018-09-27 Thread Steve Loughran


> On 25 Sep 2018, at 07:52, tigerquoll  wrote:
> 
> To give some Kerberos specific examples, The spark-submit args:
> -–conf spark.yarn.keytab=path_to_keytab -–conf
> spark.yarn.principal=princi...@realm.com
> 
> are currently not passed through to the data sources.
> 
> 
> 


I'm not sure why the data sources would need to know the kerberos login 
details, certainly I wouldn't give them the keytab path (or indeed, access to 
it), and as for the principal, UserGroupInformation getCurrentUser() should 
return that, including with support for UGI.doAs() and the ability to issue 
calls as different users from same process. 

I'd also be reluctant to blindly pass on kerberos secrets over the network. 
What does matter is that code interacting with a data source, dest, filesystem, 
etc should be executing it in the context of the intended caller, which UGI 
getCurrentUser() should do.

What does matter is that whatever authentication information is needed to 
authenticate with a data source is passed to it. That's done in the spark 
submit code for yarn by asking the filesystems, hive & hbase; I don't know 
about zookeeper there.

I think what might be good here is to enumerate what datasources are expected 
to need from kerberos (JIRA? google doc), and from any forms of service tokens, 
then see how they could be handled in a way which fits into the existing world 
of Kerberos ticket & Hadoop service token creation on submission or in job 
driver, and handoff to workers which need them

-Steve




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread Ryan Blue
I agree with Wenchen that we'd remove the prefix when passing to a source,
so you could use the same "spark.yarn.keytab" option in both places. But I
think the problem is that "spark.yarn.keytab" still needs to be set, and it
clearly isn't in a shared namespace for catalog options. So I think we
would still need a solution for existing options. I'm more comfortable with
a white list for existing options that we want to maintain compatibility
with.

rb



On Mon, Sep 24, 2018 at 11:52 PM tigerquoll  wrote:

> To give some Kerberos specific examples, The spark-submit args:
> -–conf spark.yarn.keytab=path_to_keytab -–conf
> spark.yarn.principal=princi...@realm.com
>
> are currently not passed through to the data sources.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for Kerberos

2018-09-25 Thread tigerquoll
To give some Kerberos specific examples, The spark-submit args:
-–conf spark.yarn.keytab=path_to_keytab -–conf
spark.yarn.principal=princi...@realm.com

are currently not passed through to the data sources.





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Wenchen Fan
> All of the Kerberos options already exist in their own legacy locations
though - changing their location could break a lot of systems.

We can define the prefix for shared options, and we can strip the prefix
when passing these options to the data source. Will this work for your case?

On Tue, Sep 25, 2018 at 12:57 PM tigerquoll  wrote:

> I like the shared namespace option better then the white listing option for
> any newly defined configuration information.
>
> All of the Kerberos options already exist in their own legacy locations
> though - changing their location could break a lot of systems.
>
> Perhaps we can use the shared namespace option for any new option and
> whitelisting for the existing options?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread tigerquoll
I like the shared namespace option better then the white listing option for
any newly defined configuration information.  

All of the Kerberos options already exist in their own legacy locations
though - changing their location could break a lot of systems.

Perhaps we can use the shared namespace option for any new option and
whitelisting for the existing options?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Discuss] Datasource v2 support for Kerberos

2018-09-24 Thread Ryan Blue
Dale, what do you think about the option that I suggested? I think that's
different from the ones that you just listed.

Basically, the idea is to have a "shared" set of options that are passed to
all sources. This would not be a whitelist, it would be a namespace that
ends up passed in everywhere. That way, kerberos options would be set in
the shared space, but could be set directly if you want to override.

The problem I have with your option 1 is that it requires a whiltelist,
which is difficult to maintain and doesn't have obvious behavior. If a user
wants to share an option, it has to be a special one. Otherwise the user
has to wait until we add it to a whitelist, which is slow.

I don't think your option 2 works because that's no better than what we do
today. And as you said, isolating config is a good goal.

Your option 3 is basically a whitelist, but with additional interfaces to
activate the option sets to forward. I think that's a bit too intrusive and
shares the problems that a whitelist has.

The option I'm proposing gets around those issues because it is obvious
what is happening. Any option under the shared namespace is copied to all
sources and catalogs. That doesn't require Spark to do anything to support
specific sets of options and is predictable behavior for users to
understand. It also allows us to maintain separation instead of passing all
options. I think this is a good option overall.

What do you think?

rb

On Sun, Sep 23, 2018 at 5:21 PM tigerquoll  wrote:

> I believe the current spark config system is unfortunate in the way it has
> grown - you have no way of telling which sub-systems uses which
> configuration options without direct and detailed reading of the code.
>
> Isolating config items for datasources into a separate namespaces (rather
> then using a whitelist), is a nice idea - unfortunately in this case we are
> dealing with configuration items that have been exposed to end-users in
> their current from for a significant amount of time, and Kerberos
> cross-cuts
> not only datasources, but also things like YARN.
>
> So given that fact - the best options of a way forward I can think of are:
> 1. Whitelisting of specific sub sections of the configuration space, or
> 2. Just pass in a Map[String,String] of all config values
> 3. Implement a specific interface for data sources to indicate/implement
> Kerberos support
>
> Option (1), is pretty arbitrary, and more then likely the whitelist will
> change from version to version as additional items get added to it.  Data
> sources will develop dependencies on certain configuration values being
> present in the white list.
>
> Option (2) would work, but continues the practice of having a vaguely
> specified grab-bag of config items as a dependency for practically all
> Spark
> code.
>
> I am beginning to to warm to option (3), it would be a clean way of
> declaring that a data source supports Kerberos, and also a cleanly
> specified
> way of injecting the relevant Kerberos configuration information into the
> data source - and we will not need to change any user-facing configuration
> items as well.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for Kerberos

2018-09-23 Thread tigerquoll
I believe the current spark config system is unfortunate in the way it has
grown - you have no way of telling which sub-systems uses which
configuration options without direct and detailed reading of the code.

Isolating config items for datasources into a separate namespaces (rather
then using a whitelist), is a nice idea - unfortunately in this case we are
dealing with configuration items that have been exposed to end-users in
their current from for a significant amount of time, and Kerberos cross-cuts
not only datasources, but also things like YARN.

So given that fact - the best options of a way forward I can think of are:
1. Whitelisting of specific sub sections of the configuration space, or
2. Just pass in a Map[String,String] of all config values 
3. Implement a specific interface for data sources to indicate/implement
Kerberos support 

Option (1), is pretty arbitrary, and more then likely the whitelist will
change from version to version as additional items get added to it.  Data
sources will develop dependencies on certain configuration values being
present in the white list.

Option (2) would work, but continues the practice of having a vaguely
specified grab-bag of config items as a dependency for practically all Spark
code.

I am beginning to to warm to option (3), it would be a clean way of
declaring that a data source supports Kerberos, and also a cleanly specified
way of injecting the relevant Kerberos configuration information into the
data source - and we will not need to change any user-facing configuration
items as well.
 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Discuss] Datasource v2 support for Kerberos

2018-09-19 Thread Ryan Blue
I’m not a huge fan of special cases for configuration values like this. Is
there something that we can do to pass a set of values to all sources (and
catalogs for #21306)?

I would prefer adding a special prefix for options that are passed to all
sources, like this:

spark.sql.catalog.shared.shared-property = value0
spark.sql.catalog.jdbc-prod.prop = value1
spark.datasource.source-name.prop = value2

All of the properties in the shared namespace would be passed to all
catalogs and sources. What do you think?

On Sun, Sep 16, 2018 at 6:51 PM Wenchen Fan  wrote:

> I'm +1 for this proposal: "Extend SessionConfigSupport to support passing
> specific white-listed configuration values"
>
> One goal of data source v2 API is to not depend on any high-level APIs
> like SparkSession, SQLConf, etc. If users do want to access these
> high-level APIs, there is a workaround: calling `SparkSession.getActive` or
> `SQLConf.get`.
>
> In the meanwhile, I think you use case makes sense. `SessionConfigSupport`
> is created for this use case but it's not powerful enough yet. I think it
> should support multiple key-prefixes and white-list.
>
> Feel free to submit a patch, and thanks for looking into it!
>
> On Sun, Sep 16, 2018 at 2:40 PM tigerquoll  wrote:
>
>> The current V2 Datasource API provides support for querying a portion of
>> the
>> SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport
>> API.
>> This was designed with the assumption that all configuration information
>> for
>> v2 data sources should be separate from each other.
>>
>> Unfortunately, there are some cross-cutting concerns such as
>> authentication
>> that touch multiple data sources - this means that common configuration
>> items need to be shared amongst multiple data sources.
>> In particular, Kerberos setup can use the following configuration items:
>>
>> * userPrincipal,
>> * userKeytabPath
>> * krb5ConfPath
>> * kerberos debugging flags
>> * spark.security.credentials.${service}.enabled
>> * JAAS config
>> * ZKServerPrincipal ??
>>
>> So potential solutions I can think of to pass this information to various
>> data sources are:
>>
>> * Pass the entire SparkContext object to data sources (not likely)
>> * Pass the entire SparkConfig Map object to data sources
>> * Pass all required configuration via environment variables
>> * Extend SessionConfigSupport to support passing specific white-listed
>> configuration values
>> * Add a specific data source v2 API "SupportsKerberos" so that a data
>> source
>> can indicate that it supports Kerberos and also provide the means to pass
>> needed configuration info.
>> * Expand out all Kerberos configuration items to be in each data source
>> config namespace that needs it.
>>
>> If the data source requires TLS support then we also need to support
>> passing
>> all the  configuration values under  "spark.ssl.*"
>>
>> What do people think?  Placeholder Issue has been added at SPARK-25329.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread Wenchen Fan
I'm +1 for this proposal: "Extend SessionConfigSupport to support passing
specific white-listed configuration values"

One goal of data source v2 API is to not depend on any high-level APIs like
SparkSession, SQLConf, etc. If users do want to access these high-level
APIs, there is a workaround: calling `SparkSession.getActive` or
`SQLConf.get`.

In the meanwhile, I think you use case makes sense. `SessionConfigSupport`
is created for this use case but it's not powerful enough yet. I think it
should support multiple key-prefixes and white-list.

Feel free to submit a patch, and thanks for looking into it!

On Sun, Sep 16, 2018 at 2:40 PM tigerquoll  wrote:

> The current V2 Datasource API provides support for querying a portion of
> the
> SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport
> API.
> This was designed with the assumption that all configuration information
> for
> v2 data sources should be separate from each other.
>
> Unfortunately, there are some cross-cutting concerns such as authentication
> that touch multiple data sources - this means that common configuration
> items need to be shared amongst multiple data sources.
> In particular, Kerberos setup can use the following configuration items:
>
> * userPrincipal,
> * userKeytabPath
> * krb5ConfPath
> * kerberos debugging flags
> * spark.security.credentials.${service}.enabled
> * JAAS config
> * ZKServerPrincipal ??
>
> So potential solutions I can think of to pass this information to various
> data sources are:
>
> * Pass the entire SparkContext object to data sources (not likely)
> * Pass the entire SparkConfig Map object to data sources
> * Pass all required configuration via environment variables
> * Extend SessionConfigSupport to support passing specific white-listed
> configuration values
> * Add a specific data source v2 API "SupportsKerberos" so that a data
> source
> can indicate that it supports Kerberos and also provide the means to pass
> needed configuration info.
> * Expand out all Kerberos configuration items to be in each data source
> config namespace that needs it.
>
> If the data source requires TLS support then we also need to support
> passing
> all the  configuration values under  "spark.ssl.*"
>
> What do people think?  Placeholder Issue has been added at SPARK-25329.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[Discuss] Datasource v2 support for Kerberos

2018-09-16 Thread tigerquoll
The current V2 Datasource API provides support for querying a portion of the
SparkConfig namespace (spark.datasource.*) via the SessionConfigSupport API. 
This was designed with the assumption that all configuration information for
v2 data sources should be separate from each other.

Unfortunately, there are some cross-cutting concerns such as authentication
that touch multiple data sources - this means that common configuration
items need to be shared amongst multiple data sources.
In particular, Kerberos setup can use the following configuration items:

* userPrincipal, 
* userKeytabPath
* krb5ConfPath
* kerberos debugging flags
* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

So potential solutions I can think of to pass this information to various
data sources are:

* Pass the entire SparkContext object to data sources (not likely)
* Pass the entire SparkConfig Map object to data sources
* Pass all required configuration via environment variables
* Extend SessionConfigSupport to support passing specific white-listed
configuration values
* Add a specific data source v2 API "SupportsKerberos" so that a data source
can indicate that it supports Kerberos and also provide the means to pass
needed configuration info.
* Expand out all Kerberos configuration items to be in each data source
config namespace that needs it.

If the data source requires TLS support then we also need to support passing
all the  configuration values under  "spark.ssl.*"

What do people think?  Placeholder Issue has been added at SPARK-25329.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org