[GitHub] [hudi] bryanburke opened a new issue #3641: [SUPPORT] Retrieving latest completed commit timestamp via HoodieTableMetaClient in PySpark

GitBox Fri, 10 Sep 2021 11:27:56 -0700


bryanburke opened a new issue #3641:
URL: https://github.com/apache/hudi/issues/3641



   **Describe the problem you faced**
   
   I am not experiencing a problem. I would however like to request advice/peer 
review to ensure I am using the Hudi Java classes and methods in the most 
appropriate manner.
   
   Goal: Retrieve the timestamp of the latest completed commit in a Hudi table, 
loading only Hudi metadata files from S3 in the process.
   
   Sample code below in the **To Reproduce** is the approach I am using to 
accomplish this goal in a PySpark ETL script via HoodieTableMetaClient.
   
   The overall idea is to save the timestamp of the latest completed commit on 
the source Hudi table as a bookmark so the next ETL script run can process only 
the incremental changes after that point.
   
   General questions:
   
   - Is this approach valid? If not, what alternative do you suggest?
   - Do the Hudi classes and methods I use have relatively stable public 
interfaces that are not likely to change significantly over time?
   - As development progresses, are there any plans to expose parts of Hudi's 
API via Python?
   
   I appreciate your time and expertise! Thanks for creating and maintaining 
this incredible framework!
   
   **To Reproduce**
   
   Sample code:
   
   ```python
   # sc already exists within the PySpark session.
   source_path = "s3a://example-bucket/example-table/"
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
   client = (
       sc._jvm
       .org.apache.hudi.common.table.HoodieTableMetaClient
       .builder()
       .setConf(sc._jsc.hadoopConfiguration())
       .setBasePath(source_path)
       .build()
   )
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
   timeline = client.getCommitsTimeline().filterCompletedInstants()
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/util/Option.java
   # 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
   last_instant = timeline.lastInstant().orElse(None)
   if last_instant:
       last_processed = last_instant.getTimestamp()
   ```
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.1
   
   * Hive version : 2.3.7
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3A
   
   * Running on Docker? (yes/no) : yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bryanburke opened a new issue #3641: [SUPPORT] Retrieving latest completed commit timestamp via HoodieTableMetaClient in PySpark

Reply via email to