[jira] [Assigned] (HUDI-5948) Apply maven CI friendly version

2023-03-17 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5948:


Assignee: Yann Byron

> Apply maven CI friendly version
> ---
>
> Key: HUDI-5948
> URL: https://issues.apache.org/jira/browse/HUDI-5948
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> Apply maven CI friendly version to simplify the version management.
> After apply, all modules versions can be modified by a single code change.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5948) Apply maven CI friendly version

2023-03-17 Thread Yann Byron (Jira)
Yann Byron created HUDI-5948:


 Summary: Apply maven CI friendly version
 Key: HUDI-5948
 URL: https://issues.apache.org/jira/browse/HUDI-5948
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron


Apply maven CI friendly version to simplify the version management.

After apply, all modules versions can be modified by a single code change.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5702) Avoid to write useless CDC when compaction

2023-02-04 Thread Yann Byron (Jira)
Yann Byron created HUDI-5702:


 Summary: Avoid to write useless CDC when compaction
 Key: HUDI-5702
 URL: https://issues.apache.org/jira/browse/HUDI-5702
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5701) Fail to compact mor table when enable cdc

2023-02-04 Thread Yann Byron (Jira)
Yann Byron created HUDI-5701:


 Summary: Fail to compact mor table when enable cdc
 Key: HUDI-5701
 URL: https://issues.apache.org/jira/browse/HUDI-5701
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: Yann Byron


https://github.com/apache/hudi/issues/7822



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5634) Imporve cdc-related codes

2023-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5634:


Assignee: Yann Byron

> Imporve cdc-related codes
> -
>
> Key: HUDI-5634
> URL: https://issues.apache.org/jira/browse/HUDI-5634
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> this ticket solves some comments left in 
> https://github.com/apache/hudi/pull/6727.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5634) Imporve cdc-related codes

2023-01-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-5634:


 Summary: Imporve cdc-related codes
 Key: HUDI-5634
 URL: https://issues.apache.org/jira/browse/HUDI-5634
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron


this ticket solves some comments left in 
https://github.com/apache/hudi/pull/6727.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5629) Clean cdc log files when disable cdc

2023-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5629:


Assignee: Yann Byron

> Clean cdc log files when disable cdc
> 
>
> Key: HUDI-5629
> URL: https://issues.apache.org/jira/browse/HUDI-5629
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> According the current clean logic about cdc, clean cdc log files only when 
> the cdc config is enabled. But if a table enables cdc first, then disables 
> it, will probably leave some cdc log files which can't be cleaned.
> So this pr will fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5629) Clean cdc log files when disable cdc

2023-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-5629:
-
Reviewers: Raymond Xu

> Clean cdc log files when disable cdc
> 
>
> Key: HUDI-5629
> URL: https://issues.apache.org/jira/browse/HUDI-5629
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> According the current clean logic about cdc, clean cdc log files only when 
> the cdc config is enabled. But if a table enables cdc first, then disables 
> it, will probably leave some cdc log files which can't be cleaned.
> So this pr will fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5629) Clean cdc log files when disable cdc

2023-01-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-5629:


 Summary: Clean cdc log files when disable cdc
 Key: HUDI-5629
 URL: https://issues.apache.org/jira/browse/HUDI-5629
 Project: Apache Hudi
  Issue Type: Improvement
  Components: cleaning
Reporter: Yann Byron


According the current clean logic about cdc, clean cdc log files only when the 
cdc config is enabled. But if a table enables cdc first, then disables it, will 
probably leave some cdc log files which can't be cleaned.

So this pr will fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5340) Spark SQL supports Table-Valued Function to extend more query syntax

2023-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5340:


Assignee: Yann Byron

> Spark SQL supports Table-Valued Function to extend more query syntax
> 
>
> Key: HUDI-5340
> URL: https://issues.apache.org/jira/browse/HUDI-5340
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5340) Spark SQL supports Table-Valued Function to extend more query syntax

2022-12-06 Thread Yann Byron (Jira)
Yann Byron created HUDI-5340:


 Summary: Spark SQL supports Table-Valued Function to extend more 
query syntax
 Key: HUDI-5340
 URL: https://issues.apache.org/jira/browse/HUDI-5340
 Project: Apache Hudi
  Issue Type: New Feature
  Components: spark-sql
Reporter: Yann Byron
 Fix For: 0.13.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5279) move logic for deleting active instant to HoodieActiveTimeline

2022-11-25 Thread Yann Byron (Jira)
Yann Byron created HUDI-5279:


 Summary: move logic for deleting active instant to 
HoodieActiveTimeline
 Key: HUDI-5279
 URL: https://issues.apache.org/jira/browse/HUDI-5279
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5241) Optimize HoodieDefaultTimeline API

2022-11-17 Thread Yann Byron (Jira)
Yann Byron created HUDI-5241:


 Summary: Optimize HoodieDefaultTimeline API
 Key: HUDI-5241
 URL: https://issues.apache.org/jira/browse/HUDI-5241
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5082) Improve the cdc log file name format

2022-10-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-5082:


Assignee: Yann Byron

> Improve the cdc log file name format
> 
>
> Key: HUDI-5082
> URL: https://issues.apache.org/jira/browse/HUDI-5082
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5082) Improve the cdc log file name format

2022-10-23 Thread Yann Byron (Jira)
Yann Byron created HUDI-5082:


 Summary: Improve the cdc log file name format
 Key: HUDI-5082
 URL: https://issues.apache.org/jira/browse/HUDI-5082
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron resolved HUDI-4949.
--

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4949:
-
Fix Version/s: 0.13.0

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4949:


Assignee: Yann Byron

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4915) Spark Avro SerDe returns wrong result upon multiple calls

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-4915.

Resolution: Won't Fix

> Spark Avro SerDe returns wrong result upon multiple calls
> -
>
> Key: HUDI-4915
> URL: https://issues.apache.org/jira/browse/HUDI-4915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, Spark Avro serializer/deserializer has a bug that it will return 
> the same object when we call this method twice continuously.  For example:
> val row1: InternalRow = ...
> val row2: InternalRow = ... // record2 is different with record1
>  
> val serializeredRecord1 = serialize(row1)
> val serializeredRecord2 = serialize(row2)
> serializeredRecord1.equals(serializeredRecord2)
>  
> That is because we use the `val` to declare the serializer/deserializer 
> methods, so the latter's result will cover the previous one.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-09-28 Thread Yann Byron (Jira)
Yann Byron created HUDI-4949:


 Summary: Optimize cdc read to avoid problems that caused by 
reusing buffer underlying the Row
 Key: HUDI-4949
 URL: https://issues.apache.org/jira/browse/HUDI-4949
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4948) Support flush and rollover for CDC Write

2022-09-28 Thread Yann Byron (Jira)
Yann Byron created HUDI-4948:


 Summary: Support flush and rollover for CDC Write
 Key: HUDI-4948
 URL: https://issues.apache.org/jira/browse/HUDI-4948
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core, spark, writer-core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4915) Improve Spark Avro SerDe

2022-09-25 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4915:
-
Description: 
Currently, Spark Avro serializer/deserializer has a bug that it will return the 
same object when we call this method twice continuously.  For example:

val row1: InternalRow = ...

val row2: InternalRow = ... // record2 is different with record1

 

val serializeredRecord1 = serialize(row1)

val serializeredRecord2 = serialize(row2)

serializeredRecord1.equals(serializeredRecord2)

 

That is because we use the `val` to declare the serializer/deserializer 
methods, so the latter's result will cover the previous one.

 

 

> Improve Spark Avro SerDe
> 
>
> Key: HUDI-4915
> URL: https://issues.apache.org/jira/browse/HUDI-4915
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark Avro serializer/deserializer has a bug that it will return 
> the same object when we call this method twice continuously.  For example:
> val row1: InternalRow = ...
> val row2: InternalRow = ... // record2 is different with record1
>  
> val serializeredRecord1 = serialize(row1)
> val serializeredRecord2 = serialize(row2)
> serializeredRecord1.equals(serializeredRecord2)
>  
> That is because we use the `val` to declare the serializer/deserializer 
> methods, so the latter's result will cover the previous one.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4915) Improve Spark Avro SerDe

2022-09-25 Thread Yann Byron (Jira)
Yann Byron created HUDI-4915:


 Summary: Improve Spark Avro SerDe
 Key: HUDI-4915
 URL: https://issues.apache.org/jira/browse/HUDI-4915
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4887) Use Avro as the persisted cdc data format instead of Json

2022-09-20 Thread Yann Byron (Jira)
Yann Byron created HUDI-4887:


 Summary: Use Avro as the persisted cdc data format instead of Json
 Key: HUDI-4887
 URL: https://issues.apache.org/jira/browse/HUDI-4887
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron
Assignee: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4822) Extract the baseFile and logFIles from HoodieDeltaWriteStat in the right way

2022-09-09 Thread Yann Byron (Jira)
Yann Byron created HUDI-4822:


 Summary: Extract the baseFile and logFIles from 
HoodieDeltaWriteStat in the right way
 Key: HUDI-4822
 URL: https://issues.apache.org/jira/browse/HUDI-4822
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron


currently, we can't get the `baseFile` and `logFiles` members from 
`HoodieDeltaWriteStat` directly. That's because it lost the related information 
after deserialization from the commit files. So we need to improve this.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-4703) use the corresponding schema (not the latest schema) to response the time travel query

2022-09-06 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron resolved HUDI-4703.
--

> use the corresponding schema (not the latest schema) to response the time 
> travel query 
> ---
>
> Key: HUDI-4703
> URL: https://issues.apache.org/jira/browse/HUDI-4703
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/6424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4703) use the corresponding schema (not the latest schema) to response the time travel query

2022-09-06 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4703:
-
Status: In Progress  (was: Open)

> use the corresponding schema (not the latest schema) to response the time 
> travel query 
> ---
>
> Key: HUDI-4703
> URL: https://issues.apache.org/jira/browse/HUDI-4703
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/6424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4703) use the corresponding schema (not the latest schema) to response the time travel query

2022-09-06 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4703:
-
Status: Patch Available  (was: In Progress)

> use the corresponding schema (not the latest schema) to response the time 
> travel query 
> ---
>
> Key: HUDI-4703
> URL: https://issues.apache.org/jira/browse/HUDI-4703
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/6424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4703) use the corresponding schema (not the latest schema) to response the time travel query

2022-09-06 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4703:


Assignee: Yann Byron

> use the corresponding schema (not the latest schema) to response the time 
> travel query 
> ---
>
> Key: HUDI-4703
> URL: https://issues.apache.org/jira/browse/HUDI-4703
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/6424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4705) Support Write-on-compaction mode when query cdc on MOR tables

2022-08-24 Thread Yann Byron (Jira)
Yann Byron created HUDI-4705:


 Summary: Support Write-on-compaction mode when query cdc on MOR 
tables
 Key: HUDI-4705
 URL: https://issues.apache.org/jira/browse/HUDI-4705
 Project: Apache Hudi
  Issue Type: New Feature
  Components: compaction, spark
Reporter: Yann Byron


For the case that query cdc on MOR tables, the initial implementation use the 
`Write-on-indexing`  way to extract the cdc data by merging the base file and 
log files in-flight.

This ticket wants to support the `Write-on-compaction` way to get the cdc data 
just by reading the persisted cdc files which are written at the compaction 
operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4703) use the corresponding schema (not the latest schema) to response the time travel query

2022-08-23 Thread Yann Byron (Jira)
Yann Byron created HUDI-4703:


 Summary: use the corresponding schema (not the latest schema) to 
response the time travel query 
 Key: HUDI-4703
 URL: https://issues.apache.org/jira/browse/HUDI-4703
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Reporter: Yann Byron


https://github.com/apache/hudi/issues/6424



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-3690) use all the coming records to update the existing

2022-08-04 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3690.

Resolution: Duplicate

> use all the coming records to update the existing
> -
>
> Key: HUDI-3690
> URL: https://issues.apache.org/jira/browse/HUDI-3690
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, writer-core
>Reporter: Yann Byron
>Priority: Major
>
> https://github.com/apache/hudi/issues/5000#issuecomment-1075191819



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-3690) use all the coming records to update the existing

2022-08-04 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575312#comment-17575312
 ] 

Yann Byron commented on HUDI-3690:
--

[~Pratyaksh] yes. i close this, and link to that one.

> use all the coming records to update the existing
> -
>
> Key: HUDI-3690
> URL: https://issues.apache.org/jira/browse/HUDI-3690
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, writer-core
>Reporter: Yann Byron
>Priority: Major
>
> https://github.com/apache/hudi/issues/5000#issuecomment-1075191819



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4514) optimize CTAS or saveAsTable in different modes

2022-08-01 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4514:


Assignee: Yann Byron

> optimize CTAS or saveAsTable in different modes
> ---
>
> Key: HUDI-4514
> URL: https://issues.apache.org/jira/browse/HUDI-4514
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>
> https://github.com/apache/hudi/issues/5904



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4514) optimize CTAS or saveAsTable in different modes

2022-08-01 Thread Yann Byron (Jira)
Yann Byron created HUDI-4514:


 Summary: optimize CTAS or saveAsTable in different modes
 Key: HUDI-4514
 URL: https://issues.apache.org/jira/browse/HUDI-4514
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: Yann Byron


https://github.com/apache/hudi/issues/5904



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4503) support for parsing identifier with catalog

2022-07-29 Thread Yann Byron (Jira)
Yann Byron created HUDI-4503:


 Summary: support for parsing identifier with catalog
 Key: HUDI-4503
 URL: https://issues.apache.org/jira/browse/HUDI-4503
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, spark-sql
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4503) support for parsing identifier with catalog

2022-07-29 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4503:


Assignee: Yann Byron

> support for parsing identifier with catalog
> ---
>
> Key: HUDI-4503
> URL: https://issues.apache.org/jira/browse/HUDI-4503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4494) keep the fields' order when data is written out of order

2022-07-28 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4494:
-
Fix Version/s: 0.12.0

>  keep the fields' order when data is written out of order
> -
>
> Key: HUDI-4494
> URL: https://issues.apache.org/jira/browse/HUDI-4494
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4487) support to create ro/rt table by spark sql

2022-07-28 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4487:


Assignee: Yann Byron

> support to create ro/rt table by spark sql
> --
>
> Key: HUDI-4487
> URL: https://issues.apache.org/jira/browse/HUDI-4487
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently, if the ro/rt table is missing, user just create these only by hudi 
> cli, and provide the all schema and properties like the sql below. Because if 
> execute the create-table sql in spark sql, it will convert to rename the 
> table that is not expected like this: 
> [https://github.com/apache/hudi/issues/6004.] 
>  
> {code:java}
> CREATE EXTERNAL TABLE `mor_tbl1_ro`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `id` int,
>   `name` string,
>   `ts` bigint)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='/path/to//mor_tbl1',
>   'hoodie.query.as.ro.table'='true')
> STORED AS INPUTFORMAT
>   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   '/path/to//mor_tbl1'
> TBLPROPERTIES (
>   'preCombineField'='ts',
>   'primaryKey'='id',
>   'spark.sql.create.version'='3.1.2',
>   'spark.sql.sources.provider'='hudi',
>   'spark.sql.sources.schema.numParts'='1',
>   
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
>   'transient_lastDdlTime'='1658905080',
>   'type'='mor'
> ); {code}
>  
>  
> i think hudi can support the simplified way to create ro/rt table in 
> spark-sql in the right way.
> {code:java}
> create EXTERNAL table `mor_tbl1_rt` 
> using hudi
> options(`hoodie.query.as.ro.table` = 'false')
> location '/path/to//mor_tbl1';{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4487) support to create ro/rt table by spark sql

2022-07-28 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4487:
-
Fix Version/s: 0.12.0

> support to create ro/rt table by spark sql
> --
>
> Key: HUDI-4487
> URL: https://issues.apache.org/jira/browse/HUDI-4487
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently, if the ro/rt table is missing, user just create these only by hudi 
> cli, and provide the all schema and properties like the sql below. Because if 
> execute the create-table sql in spark sql, it will convert to rename the 
> table that is not expected like this: 
> [https://github.com/apache/hudi/issues/6004.] 
>  
> {code:java}
> CREATE EXTERNAL TABLE `mor_tbl1_ro`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `id` int,
>   `name` string,
>   `ts` bigint)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='/path/to//mor_tbl1',
>   'hoodie.query.as.ro.table'='true')
> STORED AS INPUTFORMAT
>   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   '/path/to//mor_tbl1'
> TBLPROPERTIES (
>   'preCombineField'='ts',
>   'primaryKey'='id',
>   'spark.sql.create.version'='3.1.2',
>   'spark.sql.sources.provider'='hudi',
>   'spark.sql.sources.schema.numParts'='1',
>   
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
>   'transient_lastDdlTime'='1658905080',
>   'type'='mor'
> ); {code}
>  
>  
> i think hudi can support the simplified way to create ro/rt table in 
> spark-sql in the right way.
> {code:java}
> create EXTERNAL table `mor_tbl1_rt` 
> using hudi
> options(`hoodie.query.as.ro.table` = 'false')
> location '/path/to//mor_tbl1';{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4494) keep the fields' order when data is written out of order

2022-07-28 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4494:
-
Summary:  keep the fields' order when data is written out of order  (was: 
Sort data by the schema order when call "Insert")

>  keep the fields' order when data is written out of order
> -
>
> Key: HUDI-4494
> URL: https://issues.apache.org/jira/browse/HUDI-4494
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4494) Sort data by the schema order when call "Insert"

2022-07-28 Thread Yann Byron (Jira)
Yann Byron created HUDI-4494:


 Summary: Sort data by the schema order when call "Insert"
 Key: HUDI-4494
 URL: https://issues.apache.org/jira/browse/HUDI-4494
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4487) support to create ro/rt table by spark sql

2022-07-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4487:
-
Description: 
Currently, if the ro/rt table is missing, user just create these only by hudi 
cli, and provide the all schema and properties like the sql below. Because if 
execute the create-table sql in spark sql, it will convert to rename the table 
that is not expected like this: [https://github.com/apache/hudi/issues/6004.] 

 
{code:java}
CREATE EXTERNAL TABLE `mor_tbl1_ro`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `name` string,
  `ts` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='/path/to//mor_tbl1',
  'hoodie.query.as.ro.table'='true')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/path/to//mor_tbl1'
TBLPROPERTIES (
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.1.2',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
  'transient_lastDdlTime'='1658905080',
  'type'='mor'
); {code}
 

 

i think hudi can support the simplified way to create ro/rt table in spark-sql 
in the right way.
{code:java}
create EXTERNAL table `mor_tbl1_rt` 
using hudi
options(`hoodie.query.as.ro.table` = 'false')
location '/path/to//mor_tbl1';{code}

  was:
Currently, if the ro/rt table is missing, user just create these only by hudi 
cli, and provide the all schema and properties like the sql below. Because if 
execute the create-table sql in spark sql, it will convert to rename the table 
that is not expected like this: [https://github.com/apache/hudi/issues/6004.] 

 
{code:java}
CREATE EXTERNAL TABLE `mor_tbl1_ro`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `name` string,
  `ts` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='/path/to//mor_tbl1',
  'hoodie.query.as.ro.table'='true')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/path/to//mor_tbl1'
TBLPROPERTIES (
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.1.2',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
  'transient_lastDdlTime'='1658905080',
  'type'='mor'
); {code}
 

 

i think hudi can support the simplified way to create ro/rt table in spark-sql 
in the right way.
{code:java}
{code}


> support to create ro/rt table by spark sql
> --
>
> Key: HUDI-4487
> URL: https://issues.apache.org/jira/browse/HUDI-4487
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
>
> Currently, if the ro/rt table is missing, user just create these only by hudi 
> cli, and provide the all schema and properties like the sql below. Because if 
> execute the create-table sql in spark sql, it will convert to rename the 
> table that is not expected like this: 
> [https://github.com/apache/hudi/issues/6004.] 
>  
> {code:java}
> CREATE EXTERNAL TABLE

[jira] [Updated] (HUDI-4487) support to create ro/rt table by spark sql

2022-07-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4487:
-
Description: 
Currently, if the ro/rt table is missing, user just create these only by hudi 
cli, and provide the all schema and properties like the sql below. Because if 
execute the create-table sql in spark sql, it will convert to rename the table 
that is not expected like this: [https://github.com/apache/hudi/issues/6004.] 

 
{code:java}
CREATE EXTERNAL TABLE `mor_tbl1_ro`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `name` string,
  `ts` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='/path/to//mor_tbl1',
  'hoodie.query.as.ro.table'='true')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/path/to//mor_tbl1'
TBLPROPERTIES (
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.1.2',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
  'transient_lastDdlTime'='1658905080',
  'type'='mor'
); {code}
 

 

i think hudi can support the simplified way to create ro/rt table in spark-sql 
in the right way.
{code:java}
{code}

  was:
Currently, if the ro/rt table is missing, user just create these only by hudi 
cli, and provide the all schema and properties like the sql below. Because if 
execute the create-table sql in spark sql, it will convert to rename the table 
that is not expected like this: [https://github.com/apache/hudi/issues/6004.] 

```

CREATE EXTERNAL TABLE `mor_tbl1_ro`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `name` string,
  `ts` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='/path/to//mor_tbl1',
  'hoodie.query.as.ro.table'='true')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/path/to//mor_tbl1'
TBLPROPERTIES (
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.1.2',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  
'spark.sql.sources.schema.part.0'='\{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
  'transient_lastDdlTime'='1658905080',
  'type'='mor'
);

```

 

i think hudi can support the simplified way to create ro/rt table in spark-sql 
in the right way.


> support to create ro/rt table by spark sql
> --
>
> Key: HUDI-4487
> URL: https://issues.apache.org/jira/browse/HUDI-4487
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
>
> Currently, if the ro/rt table is missing, user just create these only by hudi 
> cli, and provide the all schema and properties like the sql below. Because if 
> execute the create-table sql in spark sql, it will convert to rename the 
> table that is not expected like this: 
> [https://github.com/apache/hudi/issues/6004.] 
>  
> {code:java}
> CREATE EXTERNAL TABLE `mor_tbl1_ro`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,

[jira] [Created] (HUDI-4487) support to create ro/rt table by spark sql

2022-07-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-4487:


 Summary: support to create ro/rt table by spark sql
 Key: HUDI-4487
 URL: https://issues.apache.org/jira/browse/HUDI-4487
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: Yann Byron


Currently, if the ro/rt table is missing, user just create these only by hudi 
cli, and provide the all schema and properties like the sql below. Because if 
execute the create-table sql in spark sql, it will convert to rename the table 
that is not expected like this: [https://github.com/apache/hudi/issues/6004.] 

```

CREATE EXTERNAL TABLE `mor_tbl1_ro`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `id` int,
  `name` string,
  `ts` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='/path/to//mor_tbl1',
  'hoodie.query.as.ro.table'='true')
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/path/to//mor_tbl1'
TBLPROPERTIES (
  'preCombineField'='ts',
  'primaryKey'='id',
  'spark.sql.create.version'='3.1.2',
  'spark.sql.sources.provider'='hudi',
  'spark.sql.sources.schema.numParts'='1',
  
'spark.sql.sources.schema.part.0'='\{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":true,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"ts","type":"long","nullable":true,"metadata":{}}]}',
  'transient_lastDdlTime'='1658905080',
  'type'='mor'
);

```

 

i think hudi can support the simplified way to create ro/rt table in spark-sql 
in the right way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4486) validate the coming configs and table name when create an external table

2022-07-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-4486:


 Summary: validate the coming configs and table name when create an 
external table
 Key: HUDI-4486
 URL: https://issues.apache.org/jira/browse/HUDI-4486
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-1329) Support async compaction in spark DF write()

2022-04-20 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-1329:


Assignee: (was: Yann Byron)

> Support async compaction in spark DF write()
> 
>
> Key: HUDI-1329
> URL: https://issues.apache.org/jira/browse/HUDI-1329
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction, spark, table-service
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.0
>
>
> spark.write().format("hudi").option(operation, "run_compact") to run 
> compaction
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3690) use all the coming records to update the existing

2022-03-22 Thread Yann Byron (Jira)
Yann Byron created HUDI-3690:


 Summary: use all the coming records to update the existing
 Key: HUDI-3690
 URL: https://issues.apache.org/jira/browse/HUDI-3690
 Project: Apache Hudi
  Issue Type: New Feature
  Components: spark, writer-core
Reporter: Yann Byron


https://github.com/apache/hudi/issues/5000#issuecomment-1075191819



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3232) support reload timeline Incrementally

2022-03-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3232:


Assignee: (was: Yann Byron)

> support reload timeline Incrementally
> -
>
> Key: HUDI-3232
> URL: https://issues.apache.org/jira/browse/HUDI-3232
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, incremental-query, writer-core
>Reporter: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Recently, call `HoodieTableMetaClient.reloadActiveTimeline` many times in one 
> operation, and this will reload the timeline fully.
> Perhaps, to support to reload in Incremental mode will increase the 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3232) support reload timeline Incrementally

2022-03-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3232:


Assignee: Yann Byron

> support reload timeline Incrementally
> -
>
> Key: HUDI-3232
> URL: https://issues.apache.org/jira/browse/HUDI-3232
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, incremental-query, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Recently, call `HoodieTableMetaClient.reloadActiveTimeline` many times in one 
> operation, and this will reload the timeline fully.
> Perhaps, to support to reload in Incremental mode will increase the 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3232) support reload timeline Incrementally

2022-03-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3232:
-
Labels:   (was: pull-request-available)

> support reload timeline Incrementally
> -
>
> Key: HUDI-3232
> URL: https://issues.apache.org/jira/browse/HUDI-3232
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, incremental-query, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Recently, call `HoodieTableMetaClient.reloadActiveTimeline` many times in one 
> operation, and this will reload the timeline fully.
> Perhaps, to support to reload in Incremental mode will increase the 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-1689) Support Multipath query for HoodieFileIndex

2022-03-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-1689.

Resolution: Won't Do

> Support Multipath query for HoodieFileIndex
> ---
>
> Key: HUDI-1689
> URL: https://issues.apache.org/jira/browse/HUDI-1689
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Major
>
> Support Multipath query for the HoodieFileIndex to benefit from the partition 
> prune.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3214) Optimize auto partition in spark

2022-02-28 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499274#comment-17499274
 ] 

Yann Byron commented on HUDI-3214:
--

[~xushiyan] [~shivnarayan] I think no new configs or key generator needed here. 
i plan to enable `hoodie.datasource.write.partitionpath.urlencode` and 
`hoodie.datasource.write.hive_style_partitioning` by default. And if users want 
to auto discover partition from the partitionpath, they can disable 
`hoodie.datasource.write.partitionpath.urlencode`.

> Optimize auto partition in spark
> 
>
> Key: HUDI-3214
> URL: https://issues.apache.org/jira/browse/HUDI-3214
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> recently, if partition's value has the format like 
> "pt1=/pt2=/pt3=" which split by slash, Hudi will partition 
> automatically. The directory of this table will have multi partition 
> structure.
> I think it's unpredictable. So create this umbrella task to optimize auto 
> partition in order to make the behavior more reasonable.
> Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
> There are a few of sub tasks:
>  * add a flag to control whether enable auto-partition, to make the default 
> behavior reasonable..
>  * achieve a new key generator designed specifically for this scenario.
>  * solve the bug about the different schema when enable 
> *hoodie.file.index.enable* or not in this case.
>  
> Test Codes: 
> {code:java}
> import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val tableName = "hudi_trips_cow"
> val basePath = "file:///tmp/hudi_trips_cow"
> val dataGen = new DataGenerator
> val inserts = convertToStringList(dataGen.generateInserts(10))
> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", 
> "(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
> newDf.write.format("hudi").
> options(getQuickstartWriteConfigs).
> option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> option(RECORDKEY_FIELD_OPT_KEY, "uuid").
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> option(TABLE_NAME, tableName).
> mode(Overwrite).
> save(basePath) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-1869) Upgrading Spark3 To 3.1

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-1869.


> Upgrading Spark3 To 3.1
> ---
>
> Key: HUDI-1869
> URL: https://issues.apache.org/jira/browse/HUDI-1869
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Spark 3.1 has changed some behavior of the internal class and interface for 
> both spark-sql and spark-core module.
> Currently hudi can't compile success under the spark 3.1. We need support sql 
> support for spark 3.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-1832) Support Hoodie CLI Command In Spark SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-1832.

Resolution: Duplicate

These will be supported by call produce command.

> Support Hoodie CLI Command In Spark SQL
> ---
>
> Key: HUDI-1832
> URL: https://issues.apache.org/jira/browse/HUDI-1832
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: spark
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Major
>
> Move the Hoodie CLI command to spark sql. The syntax just like the follow:
> {code:java}
> CLI_COMMAND [ (param_key1 = value1, param_key2 = value2...) ]
> {code}
> e.g.
> {code:java}
> commits showcommit 
> showfiles (commit = ‘20210114221306’, limit = 10)show 
> rollbackssavepoint create (commit = ‘20210114221306’)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-2482) Support drop partitions SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2482.


> Support drop partitions SQL
> ---
>
> Key: HUDI-2482
> URL: https://issues.apache.org/jira/browse/HUDI-2482
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-2456) Support show partitions SQL

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2456.


> Support show partitions SQL
> ---
>
> Key: HUDI-2456
> URL: https://issues.apache.org/jira/browse/HUDI-2456
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.10.0
>
>
> Spark SQL support the following syntax to show hudi tabls's partitions.
> {code:java}
> SHOW PARTITIONS tableIdentifier partitionSpec?{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-2538) Persist configs to hoodie.properties on the first write

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-2538.


> Persist configs to hoodie.properties on the first write
> ---
>
> Key: HUDI-2538
> URL: https://issues.apache.org/jira/browse/HUDI-2538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Some configs, like `keygenerator.class`, `hive_style_partitioning`, 
> `partitionpath.urlencode` should be persisted to hoodie.properties when write 
> data in the first time. Otherwise, some inconsistent behavior will happen. 
> And the other write operation do not need to provide these configs. If 
> configs provided don't match the existing configs, raise exceptions. 
> And, this is also useful to solve some of the keyGenerator discrepancy issues 
> between DataFrame writer and SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209
 ] 

Yann Byron edited comment on HUDI-3201 at 2/24/22, 6:49 AM:


h4. `hoodie.datasource.write.partitionpath.urlencode` can affect this behavior.


was (Author: biyan900...@gmail.com):
h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior.

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497209#comment-17497209
 ] 

Yann Byron commented on HUDI-3201:
--

h4. `hoodie.datasource.write.partitionpath.urlencode can affect this behavior.

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3201) Make partition auto discovery configurable

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3201.

Resolution: Fixed

> Make partition auto discovery configurable
> --
>
> Key: HUDI-3201
> URL: https://issues.apache.org/jira/browse/HUDI-3201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3202) Add keygen to support partition discovery

2022-02-23 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3202.

Resolution: Won't Do

not necessary to add another keygen for this. this behavior can be controlled 
by `hoodie.datasource.write.partitionpath.urlencode`. Making it enable by 
default can work.

> Add keygen to support partition discovery
> -
>
> Key: HUDI-3202
> URL: https://issues.apache.org/jira/browse/HUDI-3202
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3423) Upgrade Spark to 3.2.1

2022-02-22 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-3423.

Resolution: Fixed

> Upgrade Spark to 3.2.1
> --
>
> Key: HUDI-3423
> URL: https://issues.apache.org/jira/browse/HUDI-3423
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3423) Upgrade Spark to 3.2.1

2022-02-22 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3423:


Assignee: Yann Byron

> Upgrade Spark to 3.2.1
> --
>
> Key: HUDI-3423
> URL: https://issues.apache.org/jira/browse/HUDI-3423
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3478) support Changing Data Capture for Hudi

2022-02-22 Thread Yann Byron (Jira)
Yann Byron created HUDI-3478:


 Summary: support Changing Data Capture for Hudi
 Key: HUDI-3478
 URL: https://issues.apache.org/jira/browse/HUDI-3478
 Project: Apache Hudi
  Issue Type: New Feature
  Components: spark, writer-core
Reporter: Yann Byron
 Fix For: 0.12.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3478) support Changing Data Capture for Hudi

2022-02-22 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3478:


Assignee: Yann Byron

> support Changing Data Capture for Hudi
> --
>
> Key: HUDI-3478
> URL: https://issues.apache.org/jira/browse/HUDI-3478
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3423) Upgrade Spark to 3.2.1

2022-02-14 Thread Yann Byron (Jira)
Yann Byron created HUDI-3423:


 Summary: Upgrade Spark to 3.2.1
 Key: HUDI-3423
 URL: https://issues.apache.org/jira/browse/HUDI-3423
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3200) File Index config affects partition fields shown in printSchema results

2022-02-11 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3200:


Assignee: Yann Byron

> File Index config affects partition fields shown in printSchema results
> ---
>
> Key: HUDI-3200
> URL: https://issues.apache.org/jira/browse/HUDI-3200
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Discovered in HUDI-3065, disabling file index config should not affect 
> partition fields shown in printSchema. 
> It looks like since 0.9.0
> - file index = true: it enables partition auto discovery
> - file index = false: it disables partition auto discovery



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3402) Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default

2022-02-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3402:
-
Status: In Progress  (was: Open)

> Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default
> -
>
> Key: HUDI-3402
> URL: https://issues.apache.org/jira/browse/HUDI-3402
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Hoodie converts {{Timestamp}} to TIMESTAMP_MICROS format when upsert and 
> other operations, except {{{}bulk_insert{}}}.
> And {{bulk_insert}} enables 
> {{{}hoodie.datasource.write.row.writer.enable{}}}, and use 
> {{HoodieRowParquetWriteSupport}} to write datas.
> For the issue [#4552|https://github.com/apache/hudi/issues/4552] , that will 
> cause problems by default. So i suggest to modify the 
> {{hoodie.parquet.outputtimestamptype}} default value to TIMESTAMP_MICROS so 
> that it will be convenience to users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3402) Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default

2022-02-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3402:
-
Status: Patch Available  (was: In Progress)

> Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default
> -
>
> Key: HUDI-3402
> URL: https://issues.apache.org/jira/browse/HUDI-3402
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Hoodie converts {{Timestamp}} to TIMESTAMP_MICROS format when upsert and 
> other operations, except {{{}bulk_insert{}}}.
> And {{bulk_insert}} enables 
> {{{}hoodie.datasource.write.row.writer.enable{}}}, and use 
> {{HoodieRowParquetWriteSupport}} to write datas.
> For the issue [#4552|https://github.com/apache/hudi/issues/4552] , that will 
> cause problems by default. So i suggest to modify the 
> {{hoodie.parquet.outputtimestamptype}} default value to TIMESTAMP_MICROS so 
> that it will be convenience to users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3402) Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default

2022-02-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3402:


Assignee: Yann Byron

> Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default
> -
>
> Key: HUDI-3402
> URL: https://issues.apache.org/jira/browse/HUDI-3402
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Hoodie converts {{Timestamp}} to TIMESTAMP_MICROS format when upsert and 
> other operations, except {{{}bulk_insert{}}}.
> And {{bulk_insert}} enables 
> {{{}hoodie.datasource.write.row.writer.enable{}}}, and use 
> {{HoodieRowParquetWriteSupport}} to write datas.
> For the issue [#4552|https://github.com/apache/hudi/issues/4552] , that will 
> cause problems by default. So i suggest to modify the 
> {{hoodie.parquet.outputtimestamptype}} default value to TIMESTAMP_MICROS so 
> that it will be convenience to users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3333) getNestedFieldVal breaks with Spark 3.2

2022-02-10 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-:
-
Status: Patch Available  (was: In Progress)

> getNestedFieldVal breaks with Spark 3.2
> ---
>
> Key: HUDI-
> URL: https://issues.apache.org/jira/browse/HUDI-
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When set `returnNullIfNotFound` = true, the method sill throws exception. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3200) File Index config affects partition fields shown in printSchema results

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3200:
-
Status: In Progress  (was: Open)

> File Index config affects partition fields shown in printSchema results
> ---
>
> Key: HUDI-3200
> URL: https://issues.apache.org/jira/browse/HUDI-3200
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Reporter: Raymond Xu
>Priority: Critical
> Fix For: 0.11.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Discovered in HUDI-3065, disabling file index config should not affect 
> partition fields shown in printSchema. 
> It looks like since 0.9.0
> - file index = true: it enables partition auto discovery
> - file index = false: it disables partition auto discovery



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3338:
-
Status: Patch Available  (was: In Progress)

> Use custom relation instead of HadoopFsRelation
> ---
>
> Key: HUDI-3338
> URL: https://issues.apache.org/jira/browse/HUDI-3338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> For HUDI-3204, COW table and MOR table in read_optimized query mode should 
> return the '-MM-dd' format of origin `data_date`, not /MM/dd''.
> And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
> query mode of cow and the read_optimized query mode of mor.
> Spark HadoopFsRelation will append the partition value of the real partition 
> path. However, different from the normal table, Hudi will persist the 
> partition value in the parquet file. So we just need read the partition value 
> from the parquet file, not leave it to spark.
> So we should not use `HadoopFsRelation` any more, and implement Hudi own 
> `Relation` to deal with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2610) Fix Spark version info for hudi table CTAS from another hudi table

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-2610:
-
Status: In Progress  (was: Open)

> Fix Spark version info for hudi table CTAS from another hudi table
> --
>
> Key: HUDI-2610
> URL: https://issues.apache.org/jira/browse/HUDI-2610
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> See details in the original issue
>  
> https://github.com/apache/hudi/issues/3662#issuecomment-938489457



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3338:
-
Status: In Progress  (was: Open)

> Use custom relation instead of HadoopFsRelation
> ---
>
> Key: HUDI-3338
> URL: https://issues.apache.org/jira/browse/HUDI-3338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> For HUDI-3204, COW table and MOR table in read_optimized query mode should 
> return the '-MM-dd' format of origin `data_date`, not /MM/dd''.
> And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
> query mode of cow and the read_optimized query mode of mor.
> Spark HadoopFsRelation will append the partition value of the real partition 
> path. However, different from the normal table, Hudi will persist the 
> partition value in the parquet file. So we just need read the partition value 
> from the parquet file, not leave it to spark.
> So we should not use `HadoopFsRelation` any more, and implement Hudi own 
> `Relation` to deal with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3204) spark on TimestampBasedKeyGenerator has no result when query by partition column

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3204:
-
Status: Patch Available  (was: In Progress)

> spark on TimestampBasedKeyGenerator has no result when query by partition 
> column
> 
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.11.0
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
> 20220110172721896|20220110172721896...|                 1|            
> 2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| 
> +---++--+--++---++---+---+--+
>  
> // can not query any data
> spark.read.format("hudi")

[jira] [Assigned] (HUDI-3333) getNestedFieldVal breaks with Spark 3.2

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-:


Assignee: Yann Byron

> getNestedFieldVal breaks with Spark 3.2
> ---
>
> Key: HUDI-
> URL: https://issues.apache.org/jira/browse/HUDI-
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When set `returnNullIfNotFound` = true, the method sill throws exception. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3333) getNestedFieldVal breaks with Spark 3.2

2022-02-09 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-:
-
Status: In Progress  (was: Open)

> getNestedFieldVal breaks with Spark 3.2
> ---
>
> Key: HUDI-
> URL: https://issues.apache.org/jira/browse/HUDI-
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When set `returnNullIfNotFound` = true, the method sill throws exception. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3403) Manage immutable hudi Configurations

2022-02-09 Thread Yann Byron (Jira)
Yann Byron created HUDI-3403:


 Summary: Manage immutable hudi Configurations
 Key: HUDI-3403
 URL: https://issues.apache.org/jira/browse/HUDI-3403
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron
 Fix For: 0.12.0


https://github.com/apache/hudi/pull/4714#discussion_r798474157



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3402) Set hoodie.parquet.outputtimestamptype to TIMESTAMP_MICROS by default

2022-02-09 Thread Yann Byron (Jira)
Yann Byron created HUDI-3402:


 Summary: Set hoodie.parquet.outputtimestamptype to 
TIMESTAMP_MICROS by default
 Key: HUDI-3402
 URL: https://issues.apache.org/jira/browse/HUDI-3402
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, writer-core
Reporter: Yann Byron
 Fix For: 0.11.0


Hoodie converts {{Timestamp}} to TIMESTAMP_MICROS format when upsert and other 
operations, except {{{}bulk_insert{}}}.

And {{bulk_insert}} enables {{{}hoodie.datasource.write.row.writer.enable{}}}, 
and use {{HoodieRowParquetWriteSupport}} to write datas.

For the issue [#4552|https://github.com/apache/hudi/issues/4552] , that will 
cause problems by default. So i suggest to modify the 
{{hoodie.parquet.outputtimestamptype}} default value to TIMESTAMP_MICROS so 
that it will be convenience to users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2972) Support different Spark internal Timestamp and Date types

2022-02-05 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487436#comment-17487436
 ] 

Yann Byron edited comment on HUDI-2972 at 2/5/22, 9:01 AM:
---

[~ryanpife] can you retry by hudi master branch which includes this 
[HUDI-3125|https://github.com/apache/hudi/pull/4471]


was (Author: biyan900...@gmail.com):
[~ryanpife] can you retry by hudi master branch which include this 
[HUDI-3125|https://github.com/apache/hudi/pull/4471]

> Support different Spark internal Timestamp and Date types
> -
>
> Key: HUDI-2972
> URL: https://issues.apache.org/jira/browse/HUDI-2972
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Ryan Pifer
>Priority: Critical
>
> In Spark 3 a configuration was added, {{spark.sql.datetime.java8API.enabled}} 
> which can modify the internal Row type of Timestamp and Date types to 
> *Instant* or {*}LocalDate{*}. 
> https://issues.apache.org/jira/browse/SPARK-27008
> In Spark 3.1 this is enabled by default through spark-sql which will break 
> writes using Timestamps. It's also likely this could be enabled by default in 
> future across all Spark in which this would become a breaking issue
> Right now in AvroConversionHelper 
> ([ref|https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L301-L304])
>  and SqlKeyGenerator 
> ([ref|https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala])
>  it cannot handle this properly.
> When partitioned by Timestamp
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Invalid format: 
> "2021-05-07T00:00:00Z" is malformed at "T00:00:00Z" at 
> org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
>  at 
> org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
>  at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$convertPartitionPathToSqlType$1(SqlKeyGenerator.scala:94)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.convertPartitionPathToSqlType(SqlKeyGenerator.scala:85)
>  at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.getPartitionPath(SqlKeyGenerator.scala:115)
>  at 
> org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777){code}
> Inserts with type Timestamp
> {code:java}
> 21/10/21 18:14:17 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) 
> (ip-10-71-235-164.ec2.internal executor 20): java.lang.ClassCastException: 
> java.time.Instant cannot be cast to java.sql.Timestamp at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:304)
>  at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:304)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:304)
>  at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
>  at 
> org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2972) Support different Spark internal Timestamp and Date types

2022-02-05 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487436#comment-17487436
 ] 

Yann Byron commented on HUDI-2972:
--

[~ryanpife] can you retry by hudi master branch which include this 
[HUDI-3125|https://github.com/apache/hudi/pull/4471]

> Support different Spark internal Timestamp and Date types
> -
>
> Key: HUDI-2972
> URL: https://issues.apache.org/jira/browse/HUDI-2972
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Ryan Pifer
>Priority: Critical
>
> In Spark 3 a configuration was added, {{spark.sql.datetime.java8API.enabled}} 
> which can modify the internal Row type of Timestamp and Date types to 
> *Instant* or {*}LocalDate{*}. 
> https://issues.apache.org/jira/browse/SPARK-27008
> In Spark 3.1 this is enabled by default through spark-sql which will break 
> writes using Timestamps. It's also likely this could be enabled by default in 
> future across all Spark in which this would become a breaking issue
> Right now in AvroConversionHelper 
> ([ref|https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L301-L304])
>  and SqlKeyGenerator 
> ([ref|https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/SqlKeyGenerator.scala])
>  it cannot handle this properly.
> When partitioned by Timestamp
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Invalid format: 
> "2021-05-07T00:00:00Z" is malformed at "T00:00:00Z" at 
> org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
>  at 
> org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826)
>  at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$convertPartitionPathToSqlType$1(SqlKeyGenerator.scala:94)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.convertPartitionPathToSqlType(SqlKeyGenerator.scala:85)
>  at 
> org.apache.spark.sql.hudi.command.SqlKeyGenerator.getPartitionPath(SqlKeyGenerator.scala:115)
>  at 
> org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:777){code}
> Inserts with type Timestamp
> {code:java}
> 21/10/21 18:14:17 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) 
> (ip-10-71-235-164.ec2.internal executor 20): java.lang.ClassCastException: 
> java.time.Instant cannot be cast to java.sql.Timestamp at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:304)
>  at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:304)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:304)
>  at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
>  at 
> org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3314) support merge into with no-pk condition

2022-02-03 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486800#comment-17486800
 ] 

Yann Byron edited comment on HUDI-3314 at 2/4/22, 3:33 AM:
---

[~shivnarayan] This is a little complicated to implement merge with no-pk field 
with all cases considered. I'm worried about no enough time to support this, so 
set the fix versions to 0.12.0 directly. I’ll do my best to finish ahead of 
time.

if anyone has chance to take this, just do it. but wish i can help to review.


was (Author: biyan900...@gmail.com):
[~shivnarayan] This is a little complicated to implement merge with no-pk field 
with all cases considered. I'm worried about no enough time to support this, so 
set the fix versions to 0.12.0 directly.

if anyone has chance to take this, just do it. but wish i can help to review.

> support merge into with no-pk condition
> ---
>
> Key: HUDI-3314
> URL: https://issues.apache.org/jira/browse/HUDI-3314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3314) support merge into with no-pk condition

2022-02-03 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486800#comment-17486800
 ] 

Yann Byron commented on HUDI-3314:
--

[~shivnarayan] This is a little complicated to implement merge with no-pk field 
with all cases considered. I'm worried about no enough time to support this, so 
set the fix versions to 0.12.0 directly.

if anyone has chance to take this, just do it. but wish i can help to review.

> support merge into with no-pk condition
> ---
>
> Key: HUDI-3314
> URL: https://issues.apache.org/jira/browse/HUDI-3314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3338:
-
Description: 
For HUDI-3204, COW table and MOR table in read_optimized query mode should 
return the '-MM-dd' format of origin `data_date`, not /MM/dd''.

And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
query mode of cow and the read_optimized query mode of mor.

Spark HadoopFsRelation will append the partition value of the real partition 
path. However, different from the normal table, Hudi will persist the partition 
value in the parquet file. So we just need read the partition value from the 
parquet file, not leave it to spark.


So we should not use `HadoopFsRelation` any more, and implement Hudi own 
`Relation` to deal with it.

> Use custom relation instead of HadoopFsRelation
> ---
>
> Key: HUDI-3338
> URL: https://issues.apache.org/jira/browse/HUDI-3338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Priority: Major
>
> For HUDI-3204, COW table and MOR table in read_optimized query mode should 
> return the '-MM-dd' format of origin `data_date`, not /MM/dd''.
> And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
> query mode of cow and the read_optimized query mode of mor.
> Spark HadoopFsRelation will append the partition value of the real partition 
> path. However, different from the normal table, Hudi will persist the 
> partition value in the parquet file. So we just need read the partition value 
> from the parquet file, not leave it to spark.
> So we should not use `HadoopFsRelation` any more, and implement Hudi own 
> `Relation` to deal with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-01-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-3338:


 Summary: Use custom relation instead of HadoopFsRelation
 Key: HUDI-3338
 URL: https://issues.apache.org/jira/browse/HUDI-3338
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, spark-sql
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3204) spark on TimestampBasedKeyGenerator has no result when query by partition column

2022-01-26 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482858#comment-17482858
 ] 

Yann Byron commented on HUDI-3204:
--

[~shivnarayan] [~taisenki] 

i also agree that the result of incremental query is better. According to this, 
we have a list to do:
 # for cow, the field of `data_date` should return the '-MM-dd' original 
format. It should be fixed.
 # for mor. in snapshot query and read_opitimized query, hudi should response 
correctly by 'data_date' using '-MM-dd' format. 

> spark on TimestampBasedKeyGenerator has no result when query by partition 
> column
> 
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: hudi-on-call, sev:critical
> Fix For: 0.11.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
> 

[jira] [Comment Edited] (HUDI-3232) support reload timeline Incrementally

2022-01-24 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481615#comment-17481615
 ] 

Yann Byron edited comment on HUDI-3232 at 1/25/22, 7:54 AM:


[~shivnarayan]

i plan to do something in HoodieActiveTimeline.

all the methods in HoodieActiveTimeline will judge whether need to load the 
latest instant. If yes, load incrementally and execute the following codes.

Also, all the methods that will create a new instant will force to load after 
the operation.

WDYT?


was (Author: biyan900...@gmail.com):
[~shivnarayan]

yep, it's very like you comment above.

> support reload timeline Incrementally
> -
>
> Key: HUDI-3232
> URL: https://issues.apache.org/jira/browse/HUDI-3232
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, incremental-query, writer-core
>Reporter: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Recently, call `HoodieTableMetaClient.reloadActiveTimeline` many times in one 
> operation, and this will reload the timeline fully.
> Perhaps, to support to reload in Incremental mode will increase the 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3232) support reload timeline Incrementally

2022-01-24 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481615#comment-17481615
 ] 

Yann Byron commented on HUDI-3232:
--

[~shivnarayan]

yep, it's very like you comment above.

> support reload timeline Incrementally
> -
>
> Key: HUDI-3232
> URL: https://issues.apache.org/jira/browse/HUDI-3232
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, incremental-query, writer-core
>Reporter: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Recently, call `HoodieTableMetaClient.reloadActiveTimeline` many times in one 
> operation, and this will reload the timeline fully.
> Perhaps, to support to reload in Incremental mode will increase the 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3314) support merge into with no-pk condition

2022-01-24 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3314:
-
Fix Version/s: 0.12.0
   (was: 0.11.0)
Reviewers: Raymond Xu, Vinoth Chandar

> support merge into with no-pk condition
> ---
>
> Key: HUDI-3314
> URL: https://issues.apache.org/jira/browse/HUDI-3314
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Yann Byron
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3314) support merge into with no-pk condition

2022-01-24 Thread Yann Byron (Jira)
Yann Byron created HUDI-3314:


 Summary: support merge into with no-pk condition
 Key: HUDI-3314
 URL: https://issues.apache.org/jira/browse/HUDI-3314
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark-sql
Reporter: Yann Byron
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2968) Support Delete/Update using non-pk fields

2022-01-24 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480988#comment-17480988
 ] 

Yann Byron commented on HUDI-2968:
--

[~vinoth] 

if source has fields which same as target.`priamryKey` and 'preCombineField', 
it works.

I also need to perfect `merge into` in SQL, show which cases will work clearly, 
and add UT for this. Thanking for your reminding.

> Support Delete/Update using non-pk fields
> -
>
> Key: HUDI-2968
> URL: https://issues.apache.org/jira/browse/HUDI-2968
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Allow to delete/update using non-pk fields
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi 
> options (primaryKey = 'id');
> update h0 set price = 10 where name = 'foo'; 
> delete from h0 where name = 'foo';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3237) ALTER TABLE column type change fails select query

2022-01-24 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3237:
-
Status: In Progress  (was: Open)

> ALTER TABLE column type change fails select query
> -
>
> Key: HUDI-3237
> URL: https://issues.apache.org/jira/browse/HUDI-3237
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-01-13-17-04-09-038.png
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> {code:sql}
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id int
> ALTER TABLE cow_nonpt_nonpcf_tbl change column id id bigint;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id bigint
> -- this works fine so far
> select * from cow_nonpt_nonpcf_tbl;
> -- throws exception
> {code}
> {code}
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> file:///opt/spark-warehouse/cow_nonpt_nonpcf_tbl/ff3c68e6-84d4-4a8a-8bc8-cc58736847aa-0_0-7-7_20220112182401452.parquet.
>  Column: [id], Expected: bigint, Found: INT32
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> ... 20 more
> {code}
> reported while testing on 0.10.1-rc1 (spark 3.0.3, 3.1.2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3253) preferred to use table's location

2022-01-18 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3253:
-
Status: In Progress  (was: Open)

> preferred to use table's location
> -
>
> Key: HUDI-3253
> URL: https://issues.apache.org/jira/browse/HUDI-3253
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> When we create a hudi table with specified location which isn't the subpath 
> of the current database's location, and then turn this table to a managed 
> table, it'll fail to find the right table path.
> The steps you can run to reproduce:
>  
> {code:java}
> // create table in SPARK
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> )
> location '/user/hudi/cow_nonpt_nonpcf_tbl';
> // turn it to a managed table in HIVE 
> alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');
> // insert some data in SPARK
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> // will throw FileNotFoundException{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3253) preferred to use table's location

2022-01-15 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3253:
-
Description: 
When we create a hudi table with specified location which isn't the subpath of 
the current database's location, and then turn this table to a managed table, 
it'll fail to find the right table path.

The steps you can run to reproduce:

 
{code:java}
// create table in SPARK

create table if not exists cow_nonpt_nonpcf_tbl (
  id int,
  name string,
  price double
) using hudi
options (
  type = 'cow',
  primaryKey = 'id'
)
location '/user/hudi/cow_nonpt_nonpcf_tbl';

// turn it to a managed table in HIVE 
alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');

// insert some data in SPARK
insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;

// will throw FileNotFoundException{code}
 

 

 

  was:
When we create a hudi table with specified location which isn't the subpath of 
the current database's location, and then turn this table to a managed table, 
it'll fail to find the right table path.

The steps you can run to reproduce:

 
{code:java}
// create table in SPARK

create table if not exists cow_nonpt_nonpcf_tbl (
  id int,
  name string,
  price double
) using hudi
options (
  type = 'cow',
  primaryKey = 'id'
);

// turn it to a managed table in HIVE 
alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');

// insert some data in SPARK
insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;

// will throw FileNotFoundException{code}
 

 

 


> preferred to use table's location
> -
>
> Key: HUDI-3253
> URL: https://issues.apache.org/jira/browse/HUDI-3253
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark SQL
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> When we create a hudi table with specified location which isn't the subpath 
> of the current database's location, and then turn this table to a managed 
> table, it'll fail to find the right table path.
> The steps you can run to reproduce:
>  
> {code:java}
> // create table in SPARK
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> )
> location '/user/hudi/cow_nonpt_nonpcf_tbl';
> // turn it to a managed table in HIVE 
> alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');
> // insert some data in SPARK
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> // will throw FileNotFoundException{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3253) preferred to use table's location

2022-01-15 Thread Yann Byron (Jira)
Yann Byron created HUDI-3253:


 Summary: preferred to use table's location
 Key: HUDI-3253
 URL: https://issues.apache.org/jira/browse/HUDI-3253
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark SQL
Reporter: Yann Byron
 Fix For: 0.11.0


When we create a hudi table with specified location which isn't the subpath of 
the current database's location, and then turn this table to a managed table, 
it'll fail to find the right table path.

The steps you can run to reproduce:

 
{code:java}
// create table in SPARK

create table if not exists cow_nonpt_nonpcf_tbl (
  id int,
  name string,
  price double
) using hudi
options (
  type = 'cow',
  primaryKey = 'id'
);

// turn it to a managed table in HIVE 
alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');

// insert some data in SPARK
insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;

// will throw FileNotFoundException{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3253) preferred to use table's location

2022-01-15 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-3253:


Assignee: Yann Byron

> preferred to use table's location
> -
>
> Key: HUDI-3253
> URL: https://issues.apache.org/jira/browse/HUDI-3253
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark SQL
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.11.0
>
>
> When we create a hudi table with specified location which isn't the subpath 
> of the current database's location, and then turn this table to a managed 
> table, it'll fail to find the right table path.
> The steps you can run to reproduce:
>  
> {code:java}
> // create table in SPARK
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> // turn it to a managed table in HIVE 
> alter table cow_nonpt_nonpcf_tbl set tblproperties ('EXTERNAL'='false');
> // insert some data in SPARK
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> // will throw FileNotFoundException{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3240) ALTER TABLE rename breaks with managed table in Spark 2.4

2022-01-15 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476630#comment-17476630
 ] 

Yann Byron commented on HUDI-3240:
--

[~xushiyan]

If this's a managed table, `rename` operation will rename the table path at the 
same table. The new table path is `\{the location of current 
database}/newTableName`.

 

And, need the more detail about your testing environment. i can't reproduce 
this in spark2.4.4 and hudi master branch or release-0.10.1-rc1 branch. it can 
work correctly that rename table name and rename table path.

> ALTER TABLE rename breaks with managed table in Spark 2.4
> -
>
> Key: HUDI-3240
> URL: https://issues.apache.org/jira/browse/HUDI-3240
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.11.0
>
>
> {code:sql}
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> ALTER TABLE cow_nonpt_nonpcf_tbl RENAME TO cow_nonpt_nonpcf_tbl_2;
> desc cow_nonpt_nonpcf_tbl_2;
> -- desc works fine 
> select * from cow_nonpt_nonpcf_tbl_2;
> -- throws exception{code}
> {code:java}
> 22/01/13 03:48:18 ERROR SparkSQLDriver: Failed in [select * from 
> cow_nonpt_nonpcf_tbl_2]
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File 
> file:/user/hive/warehouse/cow_nonpt_nonpcf_tbl_2 does not exist
>         at 
> org.spark_project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>         at 
> org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>         at 
> org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>         at 
> org.spark_project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>         at 
> org.spark_project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>         at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>         at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>         at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>         at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>         at 
> org.spark_project.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
>         at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:141)
>         at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:227)
>         at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:264)
>         at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:255)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$apply$6.apply(AnalysisHelper.scala:113)
>         at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$apply$6.apply(AnalysisHelper.scala:113)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChi

[jira] [Comment Edited] (HUDI-3237) ALTER TABLE column type change fails select query

2022-01-13 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475182#comment-17475182
 ] 

Yann Byron edited comment on HUDI-3237 at 1/13/22, 10:48 AM:
-

If you query right now once finish to `alter table change column`, it' ll fail. 
But, if execute `alter table change column`, `insert` and query, it can work, 
that's because use `INT64` to read data instead of `INT32`.

Due to no open api to control the schema converting no matter enable 
`spark.sql.parquet.enableVectorizedReader` or not, it's hard to support to 
query in this case. 

As follows, it's hard to make the predicate is true for now.

!image-2022-01-13-17-04-09-038.png|width=686,height=371!

 

So, should we close the ability to change the datatype, just for the case can't 
work?

Also I find Changing data type is not support in parquet table and delta lake 
table.

 


was (Author: biyan900...@gmail.com):
If you query right now once finish to `alter table change column`, it' ll fail. 
But, if execute `alter table change column`, `insert` and query, it can work. 

Due to no open api to control the schema converting no matter enable 
`spark.sql.parquet.enableVectorizedReader` or not, it's hard to support to 
query in this case. 

As follows, it's hard to make the predicate is true for now.

!image-2022-01-13-17-04-09-038.png|width=686,height=371!

 

So, should we close the ability to change the datatype, just for the case can't 
work?

Also I find Changing data type is not support in parquet table and delta lake 
table.

 

> ALTER TABLE column type change fails select query
> -
>
> Key: HUDI-3237
> URL: https://issues.apache.org/jira/browse/HUDI-3237
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.11.0
>
> Attachments: image-2022-01-13-17-04-09-038.png
>
>
> {code:sql}
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id int
> ALTER TABLE cow_nonpt_nonpcf_tbl change column id id bigint;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id bigint
> -- this works fine so far
> select * from cow_nonpt_nonpcf_tbl;
> -- throws exception
> {code}
> {code}
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> file:///opt/spark-warehouse/cow_nonpt_nonpcf_tbl/ff3c68e6-84d4-4a8a-8bc8-cc58736847aa-0_0-7-7_20220112182401452.parquet.
>  Column: [id], Expected: bigint, Found: INT32
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSuppo

  1   2   3   >