stevenayers opened a new issue, #5455: URL: https://github.com/apache/hudi/issues/5455
Hi All, I'm currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 (soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0). In Iceberg, you are able to do the following to query the Glue catalog: ```python df = glueContext.create_dynamic_frame.from_options( connection_type="marketplace.spark", connection_options={ "path": "my_catalog.my_glue_database.my_iceberg_table", "connectionName": "Iceberg Connector for Glue 3.0", }, transformation_ctx="IcebergDyF", ).toDF() ``` I'd like to do something similar with Hudi: ```python df = glueContext.create_dynamic_frame.from_options( connection_type="marketplace.spark", connection_options= { "className": "org.apache.hudi", "hoodie.table.name": "my_hudi_table", "hoodie.consistency.check.enabled": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.database": "my_glue_database", "hoodie.datasource.hive_sync.table": "my_hudi_table", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": partition_key }, transformation_ctx="IcebergDyF", ) ``` Meaning we don't need to grab the S3 path of our data from boto3 every time, like so: ```python client = boto3.client('glue') response = client.get_table( DatabaseName='my_glue_database', Name='my_hudi_table' ) <<----- don't want this targetPath = response['Table']['StorageDescriptor']['Location'] <<----- or this df = glueContext.create_dynamic_frame.from_options( connection_type="marketplace.spark", connection_options= { "className": "org.apache.hudi", "path": targetPath <<----- or this "hoodie.table.name": "my_hudi_table", "hoodie.consistency.check.enabled": "true", "hoodie.datasource.hive_sync.use_jdbc": "false", "hoodie.datasource.hive_sync.database": "my_glue_database", "hoodie.datasource.hive_sync.table": "my_hudi_table", "hoodie.datasource.hive_sync.enable": "true", "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.partition_fields": partition_key }, transformation_ctx="HudiDyF", ) # OR sourceTableDF = spark.read.format('hudi').load(targetPath) ``` Is there any way to do this? Very new to Hudi, so if my configuration settings are wrong and this is possible, please let me know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org