[spark] branch master updated: [SPARK-25857][CORE] Add developer documentation regarding delegation tokens.

vanzin Tue, 15 Jan 2019 11:24:09 -0800

This is an automated email from the ASF dual-hosted git repository.

vanzin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 8a54492  [SPARK-25857][CORE] Add developer documentation regarding 
delegation tokens.
8a54492 is described below

commit 8a54492149180b57b042e3406fe4b1e53df97291
Author: Marcelo Vanzin <van...@cloudera.com>
AuthorDate: Tue Jan 15 11:23:38 2019 -0800

    [SPARK-25857][CORE] Add developer documentation regarding delegation tokens.
    
    Closes #23348 from vanzin/SPARK-25857.
    
    Authored-by: Marcelo Vanzin <van...@cloudera.com>
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>
---
 .../org/apache/spark/deploy/security/README.md     | 249 +++++++++++++++++++++
 1 file changed, 249 insertions(+)

diff --git a/core/src/main/scala/org/apache/spark/deploy/security/README.md 
b/core/src/main/scala/org/apache/spark/deploy/security/README.md
new file mode 100644
index 0000000..c3ef60a
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/deploy/security/README.md
@@ -0,0 +1,249 @@
+# Delegation Token Handling In Spark
+
+This document aims to explain and demystify delegation tokens as they are used 
by Spark, since
+this topic is generally a huge source of confusion.
+
+
+## What are delegation tokens and why use them?
+
+Delegation tokens (DTs from now on) are authentication tokens used by some 
services to replace
+Kerberos service tokens. Many services in the Hadoop ecosystem have support 
for DTs, since they
+have some very desirable advantages over Kerberos tokens:
+
+* No need to distribute Kerberos credentials
+
+In a distributed application, distributing Kerberos credentials is tricky. Not 
all users have
+keytabs, and when they do, it's generally frowned upon to distribute them over 
the network as
+part of application data.
+
+DTs allow for a single place (e.g. the Spark driver) to require Kerberos 
credentials. That entity
+can then distribute the DTs to other parts of the distributed application 
(e.g. Spark executors),
+so they can authenticate to services.
+
+* A single token per service is used for authentication
+
+If Kerberos authentication were used, each client connection to a server would 
require a trip
+to the KDC and generation of a service ticket. In a distributed system, the 
number of service
+tickets can balloon pretty quickly when you think about the number of client 
processes (e.g. Spark
+executors) vs. the number of service processes (e.g. HDFS DataNodes). That 
generates unnecessary
+extra load on the KDC, and may even run into usage limits set up by the KDC 
admin.
+
+* DTs are only used for authentication
+
+DTs, unlike TGTs, can only be used to authenticate to the specific service for 
which they were
+issued. You cannot use an existing DT to create new DTs or to create DTs for a 
different service.
+
+So in short, DTs are *not* Kerberos tokens. They are used by many services to 
replace Kerberos
+authentication, or even other forms of authentication, although there is 
nothing (aside from
+maybe implementation details) that ties them to Kerberos or any other 
authentication mechanism.
+
+
+## Lifecycle of DTs
+
+DTs, unlike Kerberos tokens, are service-specific. There is no centralized 
location you contact
+to create a DT for a service. So, the first step needed to get a DT is being 
able to authenticate
+to the service in question. In the Hadoop ecosystem, that is generally done 
using Kerberos.
+
+This requires Kerberos credentials to be available somewhere for the 
application to use. The user
+is generally responsible for providing those credentials, which is most 
commonly done by logging
+in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" 
containing a TGT
+(ticket granting ticket), which can then be used to request service tickets.
+
+There are other ways of obtaining TGTs, but, ultimately, you need a TGT to 
bootstrap the process.
+
+Once a TGT is available, the target service's client library can then be used 
to authenticate
+to the service using the Kerberos credentials, and request the creation of a 
delegation token.
+This token can now be sent to other processes and used to authenticate to 
different daemons
+belonging to that service.
+
+And thus the first drawback of DTs becomes apparent: you need service-specific 
logic to create and
+use them.
+
+Spark implements a (somewhat) pluggable, internal DT creation API. Support for 
new services can be
+added by implementing a `HadoopDelegationTokenProvider` that is then called by 
Spark when generating
+delegation tokens for an application. Spark makes the DTs available to code by 
stashing them in the
+`UserGroupInformation` credentials, and it's up to the DT provider and the 
respective client library
+to agree on how to use those tokens.
+
+Once they are created, the semantics of how DTs operate are also 
service-specific. But, in general,
+they try to follow the semantics of Kerberos tokens:
+
+* A "renewable period (equivalent to TGT's "lifetime") which is for how long 
the DT is valid
+  before it requires renewal.
+* A "max lifetime" (equivalent to TGT's "renewable life") which is for how 
long the DT can be
+  renewed.
+
+Once the token reaches its "max lifetime", a new one needs to be created by 
contacting the
+appropriate service, restarting the above process.
+
+
+## DT Renewal, Renewers, and YARN
+
+This is the most confusing part of DT handling, and part of it is because much 
of the system was
+designed with MapReduce, and later YARN, in mind.
+
+As seen above, DTs need to be renewed periodically until they finally expire 
for good. An example of
+this is the default configuration of HDFS services: delegation tokens are 
valid for up to 7 days,
+and need to be renewed every 24 hours. If 24 hours pass without the token 
being renewed, the token
+cannot be used anymore. And the token cannot be renewed anymore after 7 days.
+
+This raises the question: who renews tokens? And for a long time the answer 
was YARN.
+
+When YARN applications are submitted, a set of DTs is also submitted with 
them. YARN takes care
+of distributing these tokens to containers (using conventions set by the 
`UserGroupInformation`
+API) and, also, keeping them renewed while the app is running. These tokens 
are used not just
+by the application; they are also used by YARN itself to implement features 
like log collection
+and aggregation.
+
+But this has a few caveats.
+
+
+1. Who renews the tokens?
+
+This is handled mostly transparently by the Hadoop libraries in the case of 
YARN. Some services have
+the concept of a token "renewer". This "renewer" is the name of the principal 
that is allowed to
+renew the DT. When submitting to YARN, that will be the principal that the 
YARN service is running
+as, which means that the client application needs to know that information.
+
+For other resource managers, the renewer mostly does not matter, since there 
is no service that
+is doing the renewal. Except that it sometimes leaks into library code, such 
as in SPARK-20328.
+
+
+2. What tokens are renewed?
+
+This is probably the biggest caveat.
+
+As discussed in the previous section, DTs are service-specific, and require 
service-specific
+libraries for creation *and* renewal. This means that for YARN to be able to 
renew application
+tokens, YARN needs:
+
+* The client libraries for all the services the application is using
+* Information about how to connect to the services the application is using
+* Permissions to connect to those services
+
+In reality, though, most of the time YARN has access to a single HDFS cluster, 
and that will be
+the extent of its DT renewal features. Any other tokens sent to YARN will be 
distributed to
+containers, but will not be renewed.
+
+This means that those tokens will expire way before their max lifetime, unless 
some other code
+takes care of renewing them.
+
+Also, not all client libraries even implement token renewal. To use the 
example of a service
+supported by Spark, the `renew()` method of HBase tokens is a no-op. So the 
only way to "renew" an
+HBase token is to create a new one.
+
+
+3. What happens when tokens expire for good?
+
+The final caveat is that DTs have a maximum life, regardless of renewal. And 
after that deadline
+is met, you need to create new tokens to be able to connect to the services. 
That means you need
+the ability to connect to the service without a delegation token, which 
requires some form of
+authentication aside from DTs.
+
+This is especially important for long-running applications that run 
unsupervised. They must be
+able to keep on going without having someone logging into a terminal and 
typing a password every
+few days.
+
+
+## DT Renewal in Spark
+
+Because of the issues explained above, Spark implements a different way of 
doing renewal. Spark's
+solution is a compromise: it targets the lowest common denominator, which is 
services like HBase
+that do not support actual token renewal.
+
+In Spark, DT "renewal" is enabled by giving the application a Kerberos keytab. 
A keytab is
+basically your Kerberos password written into a plain text file, which is why 
it's so sensitive:
+if anyone is able to get hold of that keytab file, they'll be able to 
authenticate to any service
+as that user, for as long as the credentials stored in the keytab remain valid 
in the KDC.
+
+By having the keytab, Spark can indefinitely maintain a valid Kerberos TGT.
+
+With Kerberos credentials available, Spark will create new DTs for the 
configured services as old
+ones expire. So Spark doesn't renew tokens as explained in the previous 
section: it will create new
+tokens at every renewal interval instead, and distribute those tokens to 
executors.
+
+This also has another advantage on top of supporting services like HBase: it 
removes the dependency
+on an external renewal service (like YARN). That way, Spark's renewal feature 
can be used with
+resource managers that are not DT-aware, such as Mesos or Kubernetes, as long 
as the application
+has access to a keytab.
+
+
+## DTs and Proxy Users
+
+"Proxy users" is Hadoop-speak for impersonation. It allows user A to 
impersonate user B when
+connecting to a service, if that service allows it.
+
+Spark allows impersonation when submitting applications, so that the whole 
application runs as
+user B in the above example.
+
+Spark does not allow token renewal when impersonation is on. Impersonation was 
added in Spark
+as a means for services (like Hive or Oozie) to start Spark applications on 
behalf of users.
+That means that those services would provide the Spark launcher code with 
privileged credentials
+and, potentially, user code that will run when the application starts. The 
user code is not
+necessarily under control of the service.
+
+In that situation, the service credentials should never be made available to 
the Spark application,
+since that would be tantamount to giving your service credentials to 
unprivileged users.
+
+The above also implies that running impersonated applications in client mode 
can be a security
+concern, since arbitrary user code would have access to the same local content 
as the privileged
+user. But unlike token renewal, Spark does not prevent that configuration from 
running.
+
+When impersonating, the Spark launcher will create DTs for the "proxy" user. 
In the example
+used above, that means that when code authenticates to a service using the 
DTs, the authenticated
+user will be "B", not "A".
+
+Note that "proxy user" is a very Hadoop-specific concept. It does not apply to 
OS users (which
+is why the client-mode case is an issue) and to services that do not 
authenticate using Hadoop's
+`UserGroupInformation` system. It is generally used in the context of YARN - 
since an application
+submitted as a proxy user will run as that particular user in the YARN 
cluster, obeying any
+Hadoop-to-local-OS-user mapping configured for the service. But the overall 
support should work
+for connecting to other services even when YARN is not being used.
+
+Also, if writing a new DT provider in Spark, be aware that providers need to 
explicitly handle
+impersonation. If a service does not support impersonation, the provider 
should either error out or
+not generate tokens, depending on what makes more sense in the context.
+
+
+## Externally Generated DTs
+
+Spark uses the `UserGroupInformation` API to manage the Hadoop credentials. 
That means that Spark
+inherits the feature of loading DTs automatically from a file. The Hadoop 
classes will load the
+token cache pointed at by the `HADOOP_TOKEN_FILE_LOCATION` environment 
variable, when it's defined.
+
+In this situation, Spark will not create DTs for the services that already 
have tokens in the
+cache. It may try to get delegation tokens for other services if Kerberos 
credentials are also
+provided.
+
+This feature is mostly used by services that start Spark on behalf of users. 
Regular users do not
+generally use this feature, given it would require them to figure out how to 
get those tokens
+outside of Spark.
+
+
+## Limitations of DT support in Spark
+
+There are certain limitations to bear in mind when talking about DTs in Spark.
+
+The first one is that not all DTs actually expose their renewal period. This 
is generally a
+service configuration that is not generally exposed to clients. For this 
reason, certain DT
+providers cannot provide a renewal period to the Spark code, thus requiring 
that the service's
+configuration is in some way synchronized with another one that does provide 
that information.
+
+The HDFS service, which is generally available when DTs are needed in the 
first place, provides
+that information, so in general it's a good idea for all services using DTs to 
use the same
+configuration as HDFS for the renewal period.
+
+The second one is that Spark doesn't always know what delegation tokens will 
be needed. For
+example, when submitting an application in cluster mode without a keytab, the 
launcher needs
+to create DTs without knowing what the application code will actually be 
doing. This means that
+Spark will try to get as many delegation tokens as is possible based on the 
configuration
+available. That means that if an HBase configuration is available to the 
launcher but the app
+doesn't actually use HBase, a DT will still be generated. The user would have 
to explicitly
+opt out of generating HBase tokens in that case.
+
+The third one is that it's hard to create DTs "as needed". Without being able 
to authenticate
+to specific services, Spark cannot create DTs, which means that applications 
submitted in cluster
+mode like the above need DTs to be created up front, instead of on demand.
+
+The advantage, though, is that user code does not need to worry about DTs, 
since Spark will handle
+them transparently when the proper configuration is available.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-25857][CORE] Add developer documentation regarding delegation tokens.

Reply via email to