Yanjia Gary Li created HUDI-415: ----------------------------------- Summary: HoodieSparkSqlWriter Commit time not representing the Spark job starting time Key: HUDI-415 URL: https://issues.apache.org/jira/browse/HUDI-415 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Yanjia Gary Li Assignee: Yanjia Gary Li
Hudi records the commit time after the first action complete. If there is a heavy transformation before isEmpty(), then the commit time could be inaccurate. {code:java} if (hoodieRecords.isEmpty()) { log.info("new batch has no new records, skipping...") return (true, common.util.Option.empty()) } commitTime = client.startCommit() writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, commitTime, operation) {code} For example, I start the spark job at 201901010000, but *isEmpty()* ran for 2 hours, then the commit time in the .hoodie folder will be 201901010*2*00. If I use the commit time to ingest data starting from 201901010200(from HDFS, not using deltastreamer), then I will miss 2 hours of data. Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)