[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo updated HUDI-3204: -- Fix Version/s: 1.1.0 (was: 1.0.0) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.1.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|202201101727218
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo updated HUDI-3204: -- Fix Version/s: 1.0.0 (was: 1.1.0) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.0.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|202201101727218
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo updated HUDI-3204: -- Fix Version/s: 1.1.0 (was: 1.0.0) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.1.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|202201101727218
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo updated HUDI-3204: -- Sprint: Hudi 1.0 Blockers+Bugs Sprint (was: Hudi 1.0 Sprint P1 Tasks) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.0.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|2
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kate Huber updated HUDI-3204: - Sprint: Hudi 1.0 Sprint P1 Tasks > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.0.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y Ethan Guo updated HUDI-3204: -- Priority: Critical (was: Blocker) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Critical > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.0.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3204: Fix Version/s: 1.0.0 (was: 1.1.0) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Jonathan Vexler >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 1.0.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geser Dugarov updated HUDI-3204: Description: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it impossible to retrieve the original value (reading from Spark) even though it's persisted in the data file as well.{color} {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import org.apache.hudi.hive.MultiPartKeysValueExtractor val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") // mor df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_mor") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172709324|20220110172709324...| 2| 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| | 20220110172709324|20220110172709324...| 1| 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018-09-24'") // still can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018/09/24'").show // cow df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_cow") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172721896|20220110172721896...| 2| 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | 20220110172721896|20220110172721896...| 1| 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018-09-24'").show // but 2018/09/24 works spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018/09/24'").show {code} was: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original timest
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3204: - Fix Version/s: 0.14.1 > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 0.14.1 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| >
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichangfu updated HUDI-3204: Status: In Progress (was: Reopened) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichangfu updated HUDI-3204: Status: Open (was: In Progress) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-3204: - Sprint: (was: ) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d1-4e6...| 2|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Sprint: 2023-01-09 > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d1-
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Sprint: (was: 0.13.0 Final Sprint) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Fix Version/s: (was: 0.13.0) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3204: - Epic Link: HUDI-5425 > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 0.13.0 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-3204: -- Sprint: 2022/12/26 > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d1-4e6...|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Fix Version/s: (was: 0.12.1) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Affects Version/s: 0.12.0 > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.0 >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 0.12.1 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > {color:#172b4d}Currently, b/c Spark by default omits partition values from > the data files (instead encoding them into partition paths for partitioned > tables), using `TimestampBasedKeyGenerator` w/ original timestamp > based-column makes it impossible to retrieve the original value (reading from > Spark) even though it's persisted in the data file as well.{color} > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2|
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Description: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it impossible to retrieve the original value (reading from Spark) even though it's persisted in the data file as well.{color} {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import org.apache.hudi.hive.MultiPartKeysValueExtractor val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") // mor df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_mor") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172709324|20220110172709324...| 2| 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| | 20220110172709324|20220110172709324...| 1| 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018-09-24'") // still can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018/09/24'").show // cow df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_cow") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172721896|20220110172721896...| 2| 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | 20220110172721896|20220110172721896...| 1| 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018-09-24'").show // but 2018/09/24 works spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018/09/24'").show {code} was: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import org.apache.hudi.hive.MultiPartKeysVa
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Summary: Allow original partition column value to be retrieved when using TimestampBasedKeyGen (was: spark on TimestampBasedKeyGenerator has no result when query by partition column) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 0.12.1 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | > 20220110172721896|20220110172721896...| 1| > 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| > +---+--