[ https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001084#comment-17001084 ]
Yanjia Gary Li commented on HUDI-415: ------------------------------------- PR merged. Issue resolved. > HoodieSparkSqlWriter Commit time not representing the Spark job starting time > ----------------------------------------------------------------------------- > > Key: HUDI-415 > URL: https://issues.apache.org/jira/browse/HUDI-415 > Project: Apache Hudi (incubating) > Issue Type: Bug > Reporter: Yanjia Gary Li > Assignee: Yanjia Gary Li > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hudi records the commit time after the first action complete. If there is a > heavy transformation before isEmpty(), then the commit time could be > inaccurate. > {code:java} > if (hoodieRecords.isEmpty()) { > log.info("new batch has no new records, skipping...") > return (true, common.util.Option.empty()) > } > commitTime = client.startCommit() > writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, > commitTime, operation) > {code} > For example, I start the spark job at 201901010000, but *isEmpty()* ran for 2 > hours, then the commit time in the .hoodie folder will be 201901010*2*00. If > I use the commit time to ingest data starting from 201901010200(from HDFS, > not using deltastreamer), then I will miss 2 hours of data. > Is this set up intended? Can we move the commit time before isEmpty()? -- This message was sent by Atlassian Jira (v8.3.4#803005)