stevenayers opened a new issue, #5455:
URL: https://github.com/apache/hudi/issues/5455

   Hi All,
   
   I'm currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 
(soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).
   
   In Iceberg, you are able to do the following to query the Glue catalog:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options={
               "path": "my_catalog.my_glue_database.my_iceberg_table",
               "connectionName": "Iceberg Connector for Glue 3.0",
           },
           transformation_ctx="IcebergDyF",
       ).toDF()
   ```
   
   I'd like to do something similar with Hudi:
   ```python
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="IcebergDyF",
       )
   ```
   
   Meaning we don't need to grab the S3 path of our data from boto3 every time, 
like so:
   ```python
   client = boto3.client('glue')
   response = client.get_table(
       DatabaseName='my_glue_database',
       Name='my_hudi_table'
   ) <<----- don't want this
   targetPath = response['Table']['StorageDescriptor']['Location'] <<----- or 
this
   df = glueContext.create_dynamic_frame.from_options(
           connection_type="marketplace.spark",
           connection_options= {
               "className": "org.apache.hudi",
               "path": targetPath <<----- or this
               "hoodie.table.name": "my_hudi_table",
               "hoodie.consistency.check.enabled": "true",
               "hoodie.datasource.hive_sync.use_jdbc": "false",
               "hoodie.datasource.hive_sync.database": "my_glue_database",
               "hoodie.datasource.hive_sync.table":  "my_hudi_table",
               "hoodie.datasource.hive_sync.enable": "true",
               "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
               "hoodie.datasource.hive_sync.partition_fields": partition_key
           },
           transformation_ctx="HudiDyF",
       )
   # OR
   sourceTableDF = spark.read.format('hudi').load(targetPath)
   ```
   
   Is there any way to do this? Very new to Hudi, so if my configuration 
settings are wrong and this is possible, please let me know!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to