PhantomHunt opened a new issue, #8572:
URL: https://github.com/apache/hudi/issues/8572

   We have created Hudi datalake with version 0.13.0
   We need to read data from a few tables in an incremental fashion.
   To fetch the active timelines for a MoR table, we are using the following 
piece of code where basePath is the S3 bucket path where data lies :
   ```
   
metaClient=(spark._jvm.org.apache.hudi.common.table.HoodieTableMetaClient.builder().setConf(spark._jsc.hadoopConfiguration()).setBasePath(basePath).setLoadActiveTimelineOnLoad(True).build())
   
   
timeline=metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants()
   
   instants= 
timeline.getInstants().collect(spark._jvm.java.util.stream.Collectors.toList()).toArray()
   
   map_timestamps=map(lambda x : x.getTimestamp(),instants)
   
   for a_ts in map_timestamps:
        list_timestamps.append(a_ts)
   ```
   the output would be like this:
   `["20230410110310171", "20230410111802858", "20230410135802426", 
"20230410233706724", "20230411070325158", "20230412075305123", 
"20230412104308890", "20230412112440414", "20230412123348380", 
"20230412123408573", "20230412123426951", "20230412143444989", 
"20230412143503391", "20230413104721504", "20230413120831774", 
"20230413122750909", "20230413153023354", "20230414045300420", 
"20230414105813727", "20230414110336441", "20230414111346898", 
"20230414142833034", "20230414145746900", "20230414145806366", 
"20230418070525211", "20230418095219696", "20230419055721930", 
"20230419065820905", "20230419100813940", "20230419111328181", 
"20230419160833254", "20230420055335537", "20230420055920454", 
"20230420061332289", "20230420080546225", "20230420080606133", 
"20230420081457928", "20230420090733990", "20230420092736555", 
"20230420093835449", "20230420135133130"]`
   
   When we tried to read MoR table:
       ```
   mor={
           'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
           'hoodie.datasource.query.type':'incremental',
           'hoodie.datasource.read.begin.instanttime':'20230331130832572',
           'hoodie.datasource.read.end.instanttime':'20230410110310171'
       }
       try:
           
df=spark.read.format("org.apache.hudi").options(**mor).load(path_to_table)
           return df
       except Exception as e:
           log.msg(e,"e")
           return None
   ```
                
   we got the following error
   ```
   23/04/20 14:25:33 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 
36)
   java.io.FileNotFoundException: No such file or directory: 
s3a://****/2df00b9b-9fae-45a4-8492-e11ef16740b3-0_0-298-497_20230410110310171.parquet
   ```
   
   The writer configuration is - 
   ```
   writer_config = {"fs.s3a.impl""fs.s3a.impl"                     
       'hoodie.datasource.write.operation': 'upsert',                
       'hoodie.datasource.write.precombine.field': 'cdc_timestamp',  
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',        
       'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',               
       'hoodie.schema.on.read.enable' : "true",                      
       'hoodie.datasource.write.reconcile.schema' : "true",
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',                            
             
       'hoodie.table.name': table_name,                                     
       'hoodie.datasource.write.recordkey.field': 'id',           
       'hoodie.datasource.write.table.name': table_name,                     
       'hoodie.upsert.shuffle.parallelism': 200,                
       'hoodie.keep.max.commits': 50,                          
       'hoodie.keep.min.commits': 40,                          
       'hoodie.cleaner.commits.retained': 30      
       } 
   
   ```
   Language - Python
   Hudi Version - 0.13.0
   Job Type - Python script on EC2
   Table Type - Non Partitioned MOR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to