Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2087971292

   @liangchen-datanerd That's the good suggestion. Created a tracking JIRA too 
- https://issues.apache.org/jira/browse/HUDI-7698
   
   We can think of introducing the reader side config which can enable so. We 
have original data in parquet files so should not be the challenge. Feel free 
to contribute if you are interested. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-17 Thread via GitHub


liangchen-datanerd commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2063009955

   @ad1happy2go 
   Based on the IoT scenario on which I've been working, the event time would 
be adopted as the partition column. At the same time, we would query data based 
on the original timestamp event time, not the transformed partition path. 
Implementing this feature would be great help. Should I close this issue or 
leave it open?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-16 Thread via GitHub


ad1happy2go commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2059291917

   @liangchen-datanerd Thanks. Got it. So currently this is not implemented. 
Currently it is transforming the partition column after reading from parquet. 
Will check if we can prioritise this one. Although it can have regression also 
for other functionality so need to check more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-14 Thread via GitHub


liangchen-datanerd commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2054533892

   @ad1happy2go Thanks for the reply. Hudi did transform the partition column 
timestamp value to the dataformat value based on the 
hoodie.keygen.timebased.output.dateformat:-MM-dd config. At the same time, 
the original timestamp value can't be retrieved for the Spark even though it's 
persisted in the parquet file. As Hudi already has _hoodie_partition_path to 
indicate the partition path, why not keep the original data for the partition 
column? For example, when I query the Hudi table, I expect the time column to 
be a timestamp value. How can I retrieve the original timestamp value for the 
time column?
   
   This is the Hudi table query as I mentioned:
   ```
   
+---+-+--+--+--+-+--+--+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
 |employee_name|department|time  |
   
+---+-+--+--+--+-+--+--+
   |20240411142532923  |20240411142532923_1_0|James |2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James
|Sales |2023-01-02|
   |20240411142532923  |20240411142532923_1_1|Robert|2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert
   |Sales |2023-01-02|
   |20240411142532923  |20240411142532923_0_0|Michael   |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
  |Sales |2023-01-01|
   |20240411142532923  |20240411142532923_0_1|Maria |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
|Finance   |2023-01-01|
   
+---+-+--+--+--+-+--+--+
   ```
   If I have not illustrated this issue well, this ticket 
[HUDI-3204](https://issues.apache.org/jira/browse/HUDI-3204) states similar 
issue. Thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-14 Thread via GitHub


ad1happy2go commented on issue #11002:
URL: https://github.com/apache/hudi/issues/11002#issuecomment-2054100970

   @liangchen-datanerd As you are using 
'"hoodie.keygen.timebased.output.dateformat":"-MM-dd"' , it is expected for 
hudi to output the partition column value in date format only and not 
timestamp. Why do you think this is a problem?
   
   Can you elaborate on the issue you see here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] can't retrieve original partition column value when exacting date with CustomKeyGenerator [hudi]

2024-04-11 Thread via GitHub


liangchen-datanerd opened a new issue, #11002:
URL: https://github.com/apache/hudi/issues/11002

   
   **problem**
   
   the requirement was to extract date value as partition from event_time 
column. According to the hudi offical doc the ingestion config for hoodie would 
be like this
   ```
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 \
   --hoodie-conf hoodie.keygen.timebased.timestamp.type="DATE_STRING" \
   --hoodie-conf hoodie.keygen.timebased.input.dateformat="-MM-dd HH:mm:ss" 
\   
   --hoodie-conf hoodie.keygen.timebased.output.dateformat="-MM-dd" \
   ```
   the problem is that partition value was correct but when I query the table 
the partition column would be the partition value not the original value. For 
example the event_time is '2023-01-01 12:00:00' then partition value would be 
2023-01-01. But when query hudi table the event_time would be 2023-01-01 not 
the orginal value. But when I query the parquet file the event_time would be 
orginal value. 
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:   
   using pyspark shell. 
   ```
   pyspark \
   --master spark://node1:7077 \
   --packages 
'org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469' \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
   --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar  
   ```
   
   ```
   # Create a DataFrame
   data = [("James", "Sales", "2023-01-02 12:12:23"),
   ("Michael", "Sales", "2023-01-01 12:12:23"),
   ("Robert", "Sales", "2023-01-02 01:12:23"),
   ("Maria", "Finance", "2023-01-01 01:15:23")]
   df = spark.createDataFrame(data, ["employee_name", "department", "time"])
   
   # Define Hudi options
   hudi_options = {
   "hoodie.table.name":"employee_hudi",
   "hoodie.datasource.write.operation":"insert_overwrite_table",
   "hoodie.datasource.write.recordkey.field":"employee_name",
   "hoodie.datasource.write.partitionpath.field":"time:TIMESTAMP",
   
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator",
   "hoodie.keygen.timebased.timestamp.type":"DATE_STRING",
   "hoodie.keygen.timebased.input.dateformat":"-MM-dd HH:mm:ss",
   "hoodie.keygen.timebased.output.dateformat":"-MM-dd"
   }
   
   
   # Write DataFrame to Hudi
   df.write.format("hudi"). \
 options(**hudi_options). \
 mode("overwrite"). \
 save("s3a://hudi-warehouse/test/")
   
   # query hudi table
   spark.read.format("hudi") \
   .option("hoodie.schema.on.read.enable","true") \
   .load("s3a://hudi-warehouse/test/") \
   .show(truncate=False)  
   
   # read parquet file\
   spark.read.format("parquet") \
   
.load("s3a://hudi-warehouse/test/2023-01-01/ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet")
 \
   .show(truncate=False)
   ```
   
   when I query hudi table the result:
   ```
   
+---+-+--+--+--+-+--+--+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
 |employee_name|department|time  |
   
+---+-+--+--+--+-+--+--+
   |20240411142532923  |20240411142532923_1_0|James |2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|James
|Sales |2023-01-02|
   |20240411142532923  |20240411142532923_1_1|Robert|2023-01-02 
   
|ea678686-d3d3-4555-b894-30ecb1da2a47-0_1-134-190_20240411142532923.parquet|Robert
   |Sales |2023-01-02|
   |20240411142532923  |20240411142532923_0_0|Michael   |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Michael
  |Sales |2023-01-01|
   |20240411142532923  |20240411142532923_0_1|Maria |2023-01-01 
   
|ec109c4b-723f-46ce-8bb2-5d1e57ecc204-0_0-134-191_20240411142532923.parquet|Maria
|Finance   |2023-01-01|
   
+---+-+--+--+--+-+--+--+
   ```
   when I read the parquet file the result:  
   ```