PhantomHunt opened a new issue, #8572: URL: https://github.com/apache/hudi/issues/8572
We have created Hudi datalake with version 0.13.0 We need to read data from a few tables in an incremental fashion. To fetch the active timelines for a MoR table, we are using the following piece of code where basePath is the S3 bucket path where data lies : ``` metaClient=(spark._jvm.org.apache.hudi.common.table.HoodieTableMetaClient.builder().setConf(spark._jsc.hadoopConfiguration()).setBasePath(basePath).setLoadActiveTimelineOnLoad(True).build()) timeline=metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants() instants= timeline.getInstants().collect(spark._jvm.java.util.stream.Collectors.toList()).toArray() map_timestamps=map(lambda x : x.getTimestamp(),instants) for a_ts in map_timestamps: list_timestamps.append(a_ts) ``` the output would be like this: `["20230410110310171", "20230410111802858", "20230410135802426", "20230410233706724", "20230411070325158", "20230412075305123", "20230412104308890", "20230412112440414", "20230412123348380", "20230412123408573", "20230412123426951", "20230412143444989", "20230412143503391", "20230413104721504", "20230413120831774", "20230413122750909", "20230413153023354", "20230414045300420", "20230414105813727", "20230414110336441", "20230414111346898", "20230414142833034", "20230414145746900", "20230414145806366", "20230418070525211", "20230418095219696", "20230419055721930", "20230419065820905", "20230419100813940", "20230419111328181", "20230419160833254", "20230420055335537", "20230420055920454", "20230420061332289", "20230420080546225", "20230420080606133", "20230420081457928", "20230420090733990", "20230420092736555", "20230420093835449", "20230420135133130"]` When we tried to read MoR table: ``` mor={ 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', 'hoodie.datasource.query.type':'incremental', 'hoodie.datasource.read.begin.instanttime':'20230331130832572', 'hoodie.datasource.read.end.instanttime':'20230410110310171' } try: df=spark.read.format("org.apache.hudi").options(**mor).load(path_to_table) return df except Exception as e: log.msg(e,"e") return None ``` we got the following error ``` 23/04/20 14:25:33 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 36) java.io.FileNotFoundException: No such file or directory: s3a://****/2df00b9b-9fae-45a4-8492-e11ef16740b3-0_0-298-497_20230410110310171.parquet ``` The writer configuration is - ``` writer_config = {"fs.s3a.impl""fs.s3a.impl" 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'cdc_timestamp', 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.schema.on.read.enable' : "true", 'hoodie.datasource.write.reconcile.schema' : "true", 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.table.name': table_name, 'hoodie.upsert.shuffle.parallelism': 200, 'hoodie.keep.max.commits': 50, 'hoodie.keep.min.commits': 40, 'hoodie.cleaner.commits.retained': 30 } ``` Language - Python Hudi Version - 0.13.0 Job Type - Python script on EC2 Table Type - Non Partitioned MOR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org