Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
ad1happy2go commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2087971292 @liangchen-datanerd That's the good suggestion. Created a tracking JIRA too - https://issues.apache.org/jira/browse/HUDI-7698 We can think of introducing the reader side config which can enable so. We have original data in parquet files so should not be the challenge. Feel free to contribute if you are interested. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
liangchen-datanerd commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2063009955 @ad1happy2go Based on the IoT scenario on which I've been working, the event time would be adopted as the partition column. At the same time, we would query data based on the original timestamp event time, not the transformed partition path. Implementing this feature would be great help. Should I close this issue or leave it open? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
ad1happy2go commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2059291917 @liangchen-datanerd Thanks. Got it. So currently this is not implemented. Currently it is transforming the partition column after reading from parquet. Will check if we can prioritise this one. Although it can have regression also for other functionality so need to check more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
liangchen-datanerd commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2054533892 @ad1happy2go Thanks for the reply. Hudi did transform the partition column timestamp value to the dataformat value based on the hoodie.keygen.timebased.output.dateformat:-MM-dd config. At the same time, the original timestamp value can't be retrieved for the Spark even though it's persisted in the parquet file. As Hudi already has _hoodie_partition_path to indicate the partition path, why not keep the original data for the partition column? For example, when I query the Hudi table, I expect the time column to be a timestamp value. How can I retrieve the original timestamp value for the time column? This is the Hudi table query as I mentioned: ``` +---+-+--+--+--+-+--+--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |employee_name|department|time | +---+-+--+--+--+-+--+--+ |20240411142532923 |20240411142532923_1_0|James |2023-01-02 |ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James |Sales |2023-01-02| |20240411142532923 |20240411142532923_1_1|Robert|2023-01-02 |ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert |Sales |2023-01-02| |20240411142532923 |20240411142532923_0_0|Michael |2023-01-01 |ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael |Sales |2023-01-01| |20240411142532923 |20240411142532923_0_1|Maria |2023-01-01 |ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria |Finance |2023-01-01| +---+-+--+--+--+-+--+--+ ``` If I have not illustrated this issue well, this ticket [HUDI-3204](https://issues.apache.org/jira/browse/HUDI-3204) states similar issue. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
ad1happy2go commented on issue #11002: URL: https://github.com/apache/hudi/issues/11002#issuecomment-2054100970 @liangchen-datanerd As you are using '"hoodie.keygen.timebased.output.dateformat":"-MM-dd"' , it is expected for hudi to output the partition column value in date format only and not timestamp. Why do you think this is a problem? Can you elaborate on the issue you see here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]
liangchen-datanerd opened a new issue, #11002: URL: https://github.com/apache/hudi/issues/11002 **problem** the requirement was to extract date value as partition from event_time column. According to the hudi offical doc the ingestion config for hoodie would be like this ``` --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.keygen.timebased.timestamp.type="DATE_STRING" \ --hoodie-conf hoodie.keygen.timebased.input.dateformat="-MM-dd HH:mm:ss" \ --hoodie-conf hoodie.keygen.timebased.output.dateformat="-MM-dd" \ ``` the problem is that partition value was correct but when I query the table the partition column would be the partition value not the original value. For example the event_time is '2023-01-01 12:00:00' then partition value would be 2023-01-01. But when query hudi table the event_time would be 2023-01-01 not the orginal value. But when I query the parquet file the event_time would be orginal value. **To Reproduce** Steps to reproduce the behavior: using pyspark shell. ``` pyspark \ --master spark://node1:7077 \ --packages 'org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469' \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar ``` ``` # Create a DataFrame data = [("James", "Sales", "2023-01-02 12:12:23"), ("Michael", "Sales", "2023-01-01 12:12:23"), ("Robert", "Sales", "2023-01-02 01:12:23"), ("Maria", "Finance", "2023-01-01 01:15:23")] df = spark.createDataFrame(data, ["employee_name", "department", "time"]) # Define Hudi options hudi_options = { "hoodie.table.name":"employee_hudi", "hoodie.datasource.write.operation":"insert_overwrite_table", "hoodie.datasource.write.recordkey.field":"employee_name", "hoodie.datasource.write.partitionpath.field":"time:TIMESTAMP", "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator", "hoodie.keygen.timebased.timestamp.type":"DATE_STRING", "hoodie.keygen.timebased.input.dateformat":"-MM-dd HH:mm:ss", "hoodie.keygen.timebased.output.dateformat":"-MM-dd" } # Write DataFrame to Hudi df.write.format("hudi"). \ options(**hudi_options). \ mode("overwrite"). \ save("s3a://hudi-warehouse/test/") # query hudi table spark.read.format("hudi") \ .option("hoodie.schema.on.read.enable","true") \ .load("s3a://hudi-warehouse/test/") \ .show(truncate=False) # read parquet file\ spark.read.format("parquet") \ .load("s3a://hudi-warehouse/test/2023-01-01/ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet") \ .show(truncate=False) ``` when I query hudi table the result: ``` +---+-+--+--+--+-+--+--+ |_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |employee_name|department|time | +---+-+--+--+--+-+--+--+ |20240411142532923 |20240411142532923_1_0|James |2023-01-02 |ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James |Sales |2023-01-02| |20240411142532923 |20240411142532923_1_1|Robert|2023-01-02 |ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert |Sales |2023-01-02| |20240411142532923 |20240411142532923_0_0|Michael |2023-01-01 |ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael |Sales |2023-01-01| |20240411142532923 |20240411142532923_0_1|Maria |2023-01-01 |ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria |Finance |2023-01-01| +---+-+--+--+--+-+--+--+ ``` when I read the parquet file the result: ```