h1. Summary
Enable KinesisReceiver to utilize STSAssumeRoleSessionCredentialsProvider when setting up the Kinesis Client Library in order to enable secure cross-account Kinesis stream reads managed by AWS Simple Token Service (STS)
h1. Details
Spark's KinesisReceiver implementation utilizes the Kinesis Client Library in order to allow users to write Spark Streaming jobs that operate on Kinesis data. The KCL uses a few AWS services under the hood in order to provide checkpointed, load-balanced processing of the underlying data in a Kinesis stream. Running the KCL requires permissions to be set up for the following AWS resources.
* AWS Kinesis for reading stream data * AWS DynamoDB for storing KCL shared state in tables * AWS CloudWatch for logging KCL metrics
The KinesisUtils.createStream() API allows users to authenticate to these services either by specifying an explicit AWS access key/secret key credential pair or by using the default credential provider chain. This supports authorizing to the three AWS services using either an AWS keypair (either provided explicitly or parsed from environment variables, etc.):
!https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/KeypairOnly.png!
Or the IAM instance profile (when running on EC2):
!https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/InstanceProfileOnly.png!
AWS users often need to access resources across separate accounts. This could be done in order to consume data produced by another organization or from a service running in another account for resource isolation purposes. AWS Simple Token Service (STS) provides a secure way to authorize cross-account resource access by using temporary sessions to assuming an IAM role in the AWS account with the resources being accessed.
The [IAM documentation|http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html] covers the specifics of how cross account IAM role assumption works in much greater detail, but if an actor in account A wanted to read from a Kinesis stream in account B the general steps required would look something like this:
* An IAM role is added to account B with read permissions for the Kinesis stream ** Trust policy is configured to allow account A to assume the role * Actor in account A uses its own long-lived credentials to tell STS to assume the role in account B * STS returns temporary credentials with permission to read from the stream in account B
Applied to KinesisReceiver and the KCL, we could use a keypair as our long-lived credentials to authenticate to STS and assume an external role with the necessary KCL permissions:
!https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSKeypair.png!
Or the instance profile as long-lived credentials:
!https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSInstanceProfile.png!
The STSAssumeRoleSessionCredentialsProvider implementation of the AWSCredentialsProviderChain interface from the AWS SDK abstracts all of the management of the temporary session credentials away from the user. STSAssumeRoleSessionCredentialsProvider simply needs the ARN of the AWS role to be assumed, a session name for STS labeling purposes, an optional session external ID and long-lived credentials to use for authenticating with the STS service itself.
Supporting cross-account Kinesis access via STS requires supplying the following additional configuration parameters:
* ARN of IAM role to assume in external account * A name to apply to the STS session * (optional) An IAM external ID to validate the assumed role against
The STSAssumeRoleSessionCredentialsProvider implementation of the AWSCredentialsProvider interface takes these parameters as input and abstracts away all of the lifecycle management for the temporary session credentials. Ideally, users could simply supply an AWSCredentialsProvider instance as an argument when creating the stream that would be distributed to the executors for use when setting up the KCL. Since these classes aren't serializable this will require an approach similar to SerializableAWSCredentials where the config parameters are passed via a serializable object and the correct AWSCredentialsProvider implementation is created on the executor from the params.
Following the current conventions, adding optional arguments will mean having to double the number of overloaded implementations of KinesisUtils.createStream(). For this reason, we can make stsAssumeRoleArn, stsSessionName and stsExternalId each required parameters for STS authentication (external id is ignored if none is specified in the trust policy < link > of the assumed role).
Here are the providers that should be used for authentication depending on the combination of AWS parameters provided:
||Input params||Kinesis credentials||Long-lived credentials|| | Enter unrollSafely ( )|-|0|0| |( none)|Use long-lived|DefaultAWSCredentialsProviderChain| |awsAccessKeyId, awsSecretKey|Use long-lived|AWSCredentialsProvider w/keypair| |stsRoleArn, stsSessionName, stsExternalId (optional)|STSAssumeRoleSessionCredentialsProvider|DefaultAWSCredentialsProviderChain| |awsAccessKeyId, awsSecretKey, stsRoleArn, stsSessionName, stsExternalId (optional)|STSAssumeRoleSessionCredentialsProvider|AWSCredentialsProvider w/keypair|
Since there's now a wide variety of combinations of optional parameters for KinesisUtils, I think a builder pattern may provide a more manageable interface for creating streams in both Scala and Java. This would also make it feasible to specify specific AWS config params for DynamoDB and CloudWatch, which is supported by the KCL. I may look into submitting an issue/PR for this as well.
|