[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580711#comment-17580711 ]
Steve Loughran commented on SPARK-38954: ---------------------------------------- any plans to put the PR up? i'm curious about what you've done. The hadoop s3a delegation tokens can be used to collect credentials and encryption secrets at spark launch, pass them to workers, though there's no mechanism to update tokens during the life of a session. you might want to look at this code, and experiment with it. if you are doing your own provider, do update credentials at least 30s before they expire, and add some sync blocks so that 30 threads don't all try and do it independently. > Implement sharing of cloud credentials among driver and executors > ----------------------------------------------------------------- > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: Parth Chandra > Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org