demshar23 opened a new issue, #12471:
URL: https://github.com/apache/iceberg/issues/12471
### Query engine
Pyspark in AWS glue
### Question
I am trying to use the rewrite_table_path procedure in an AWS Glue Version 5
pyspark job or notebook, where I am setting the spark config to assume a role
in a cross-account (using the AssumeRoleAwsClientFactory config) to execute the
procedure instead of the glue job execution role. When I run the procedure, it
puts the modified s3 metadata files in the s3 metadata staging location using
the assumed role as verified by the s3 access logs, but when it attempts to
commit the final csv output manifest file that contains the data files and
metadata files to be copied, it instead utilizes the glue execution role. I can
also see via cloudtrail that it is using the assumed role to make the
glue:GetTable API call.
So it seems it is successfully using the assumed for the glue client, and
partially for the s3 client when executing the procedure.
Config:
I'm importing the iceberg-core-1.8.0, iceberg-spark-runtime-3.5_2.12-1.8.0,
and the iceberg-aws-bundle-1.8.0 jars into the job and using the following
spark session builder config:
spark = SparkSession.builder \
.config(f"spark.sql.catalog.{catalog_name}",
"org.apache.iceberg.spark.SparkSessionCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.warehouse",
f"{warehouse_path}") \
.config(f"spark.sql.catalog.{catalog_name}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO") \
.config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
\
.config(f"spark.sql.catalog.{catalog_name}.client.region",
f"{aws_region}") \
.config(f"spark.sql.catalog.{catalog_name}.glue.id",
f"{aws_account_id}") \
.config(f"spark.sql.catalog.{catalog_name}.client.factory",
"org.apache.iceberg.aws.AssumeRoleAwsClientFactory") \
.config(f"spark.sql.catalog.{catalog_name}.client.assume-role.arn",
"{role_arn}") \
.config(f"spark.sql.catalog.{catalog_name}.client.assume-role.session-name",
"{session_name}") \
.config(f"spark.sql.catalog.{catalog_name}.client.assume-role.region",
f"{aws_region}") \
.getOrCreate()
If i don't import the jars and simply try to query the glue table in the
session with the same configuration, it does use the assumed role correctly to
hit both the S3 and Glue APIs with the respective clients through the
AssumeRoleAwsClientFactory config, so it seems to be isolated to the procedure,
as I can run other procedures like rewrite_data_files and it correctly uses the
assumed role to append S3 with the rewritten data files instead of the glue
execution role. The breakdown occurs only when it is trying to place the final
csv file into the file-list subfolder within the staging location.
Any ideas on an alternative configuration that may work or is it maybe a
confliction in the procedure itself?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]