[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-10-16 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-3204:
--
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.1.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|202201101727218

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-10-14 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-3204:
--
Fix Version/s: 1.0.0
   (was: 1.1.0)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.0.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|202201101727218

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-10-14 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-3204:
--
Fix Version/s: 1.1.0
   (was: 1.0.0)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.1.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|202201101727218

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-10-14 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-3204:
--
Sprint: Hudi 1.0 Blockers+Bugs Sprint  (was: Hudi 1.0 Sprint P1 Tasks)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.0.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|2

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-09-30 Thread Kate Huber (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kate Huber updated HUDI-3204:
-
Sprint: Hudi 1.0 Sprint P1 Tasks

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.0.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2| 

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-09-30 Thread Y Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-3204:
--
Priority: Critical  (was: Blocker)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.0.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-08-16 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3204:

Fix Version/s: 1.0.0
   (was: 1.1.0)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 1.0.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
>  
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2024-05-20 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-3204:

Description: 
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it 
impossible to retrieve the original value (reading from Spark) even though it's 
persisted in the data file as well.{color}

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.hive.MultiPartKeysValueExtractor

val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
"2018-09-24")).toDF("id", "name", "age", "ts", "data_date")

// mor
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_mor")

+---++--+--++---++---+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
+---++--+--++---++---+---+--+
|  20220110172709324|20220110172709324...|                 2|            
2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
|  20220110172709324|20220110172709324...|                 1|            
2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
+---++--+--++---++---+---+--+

// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018-09-24'")
// still can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018/09/24'").show 

// cow
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_cow") 

+---++--+--++---++---+---+--+
 
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
 
+---++--+--++---++---+---+--+
 |  20220110172721896|20220110172721896...|                 2|            
2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24|
 |  20220110172721896|20220110172721896...|                 1|            
2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23|
 
+---++--+--++---++---+---+--+
 
// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018-09-24'").show

// but 2018/09/24 works
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018/09/24'").show  {code}
 

 

  was:
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original timest

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2023-11-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3204:
-
Fix Version/s: 0.14.1

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.14.1
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2023-04-20 Thread lichangfu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichangfu updated HUDI-3204:

Status: In Progress  (was: Reopened)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2023-04-20 Thread lichangfu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichangfu updated HUDI-3204:

Status: Open  (was: In Progress)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2023-01-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-3204:
-
Sprint:   (was: )

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...|  2|

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Sprint: 2023-01-09

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Sprint:   (was: 0.13.0 Final Sprint)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Fix Version/s: (was: 0.13.0)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3204:
-
Epic Link: HUDI-5425

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.13.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-12-19 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3204:
--
Sprint: 2022/12/26

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...| 

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-10-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Fix Version/s: (was: 0.12.1)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-10-14 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Affects Version/s: 0.12.0

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.12.0
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.12.1
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
> {color:#172b4d}Currently, b/c Spark by default omits partition values from 
> the data files (instead encoding them into partition paths for partitioned 
> tables), using `TimestampBasedKeyGenerator` w/ original timestamp 
> based-column makes it impossible to retrieve the original value (reading from 
> Spark) even though it's persisted in the data file as well.{color}
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2| 

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-09-29 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Description: 
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it 
impossible to retrieve the original value (reading from Spark) even though it's 
persisted in the data file as well.{color}

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.hive.MultiPartKeysValueExtractor

val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
"2018-09-24")).toDF("id", "name", "age", "ts", "data_date")

// mor
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_mor")

+---++--+--++---++---+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
+---++--+--++---++---+---+--+
|  20220110172709324|20220110172709324...|                 2|            
2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
|  20220110172709324|20220110172709324...|                 1|            
2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
+---++--+--++---++---+---+--+

// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018-09-24'")
// still can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018/09/24'").show 

// cow
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_cow") 

+---++--+--++---++---+---+--+
 
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date| 
+---++--+--++---++---+---+--+
 |  20220110172721896|20220110172721896...|                 2|            
2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
20220110172721896|20220110172721896...|                 1|            
2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| 
+---++--+--++---++---+---+--+
 
// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018-09-24'").show

// but 2018/09/24 works
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018/09/24'").show  {code}
 

 

  was:
 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.hive.MultiPartKeysVa

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-09-29 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Summary: Allow original partition column value to be retrieved when using 
TimestampBasedKeyGen  (was: spark on TimestampBasedKeyGenerator has no result 
when query by partition column)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.12.1
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
> 20220110172721896|20220110172721896...|                 1|            
> 2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| 
> +---+--