[jira] [Created] (HUDI-7984) Spark SQL should allow type promotion without using schema on read

2024-07-12 Thread Shiyan Xu (Jira)
Shiyan Xu created HUDI-7984:
---

 Summary: Spark SQL should allow type promotion without using 
schema on read
 Key: HUDI-7984
 URL: https://issues.apache.org/jira/browse/HUDI-7984
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Reporter: Shiyan Xu
 Fix For: 1.0.0


{code:java}
CREATE TABLE hudi_table (
ts BIGINT,
uuid STRING,
rider INT,
driver STRING,
fare DOUBLE,
city STRING
) USING HUDI
PARTITIONED BY (city);

alter table hudi_table alter column rider type BIGINT;

ALTER TABLE CHANGE COLUMN is not supported for changing column 'rider' with 
type 'IntegerType' to 'rider' with type 'LongType' {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7960) Support more partitioner in Hudi Flink integration

2024-07-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu reassigned HUDI-7960:
---

Assignee: Zhenqiu Huang

> Support more partitioner in Hudi Flink integration 
> ---
>
> Key: HUDI-7960
> URL: https://issues.apache.org/jira/browse/HUDI-7960
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s

2024-06-24 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-7929:

Component/s: flink

> Add Flink Hudi Example for K8s
> --
>
> Key: HUDI-7929
> URL: https://issues.apache.org/jira/browse/HUDI-7929
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s

2024-06-24 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-7929:

Fix Version/s: 1.0.0

> Add Flink Hudi Example for K8s
> --
>
> Key: HUDI-7929
> URL: https://issues.apache.org/jira/browse/HUDI-7929
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7929) Add Flink Hudi Example for K8s

2024-06-24 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu reassigned HUDI-7929:
---

Assignee: Zhenqiu Huang

> Add Flink Hudi Example for K8s
> --
>
> Key: HUDI-7929
> URL: https://issues.apache.org/jira/browse/HUDI-7929
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7376) Cancel running test instances in Azure CI when the PR is updated

2024-06-15 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-7376.
---
  Assignee: Shiyan Xu  (was: Raymond Xu)
Resolution: Cannot Reproduce

> Cancel running test instances in Azure CI when the PR is updated
> 
>
> Key: HUDI-7376
> URL: https://issues.apache.org/jira/browse/HUDI-7376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Shiyan Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4705) Support Write-on-compaction mode when query cdc on MOR tables

2024-06-07 Thread Shiyan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853214#comment-17853214
 ] 

Shiyan Xu commented on HUDI-4705:
-

[~lizhiqiang] [~biyan900...@gmail.com] to clarify, CDC for spark works on MOR, 
just that the implementation is using write-on-indexing strategy (ref: 
[https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md#persisting-cdc-in-mor-write-on-indexing-vs-write-on-compaction)]

 

We want to unify the implementation as write-on-compaction, which allows flink 
writer to work too. (write-on-indexing strategy does not work for flink as 
explained in the RFC)

> Support Write-on-compaction mode when query cdc on MOR tables
> -
>
> Key: HUDI-4705
> URL: https://issues.apache.org/jira/browse/HUDI-4705
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction, spark, table-service
>Reporter: Yann Byron
>Priority: Major
>
> For the case that query cdc on MOR tables, the initial implementation use the 
> `Write-on-indexing`  way to extract the cdc data by merging the base file and 
> log files in-flight.
> This ticket wants to support the `Write-on-compaction` way to get the cdc 
> data just by reading the persisted cdc files which are written at the 
> compaction operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6633) Add hms based sync to hudi website

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-6633.
---
Resolution: Fixed

> Add hms based sync to hudi website
> --
>
> Key: HUDI-6633
> URL: https://issues.apache.org/jira/browse/HUDI-6633
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Shiyan Xu
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> we should add hms based sync to our hive sync page 
> [https://hudi.apache.org/docs/syncing_metastore]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-4967:

Status: Open  (was: Patch Available)

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4834) Update AWSGlueCatalog syncing oage to add spark datasource example

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-4834.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Update AWSGlueCatalog syncing oage to add spark datasource example
> --
>
> Key: HUDI-4834
> URL: https://issues.apache.org/jira/browse/HUDI-4834
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: documentation
> Fix For: 0.15.0, 1.0.0
>
>
> [https://hudi.apache.org/docs/next/syncing_aws_glue_data_catalog] this page 
> specifically talks about how to leverage this syncing mechanism via 
> Deltastreamer. We also need example for spark datasource here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-4967.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> Improve docs for meta sync with TimestampBasedKeyGenerator
> --
>
> Key: HUDI-4967
> URL: https://issues.apache.org/jira/browse/HUDI-4967
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> Related fix: HUDI-4966
> We need to add docs on how to properly set the meta sync configuration, 
> especially the hoodie.datasource.hive_sync.partition_value_extractor, in 
> [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, 
> the config can be different).  Check the ticket above and PR description of 
> [https://github.com/apache/hudi/pull/6851] for more details.
> We should also add the migration setup on the key generation page as well: 
> [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates]
>  * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config 
> is used to extract and transform partition value during Hive sync. Its 
> default value has been changed from 
> {{SlashEncodedDayPartitionValueExtractor}} to 
> {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default 
> value (i.e., have not set it explicitly), you are required to set the config 
> to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From 
> this release, if this config is not set and Hive sync is enabled, then 
> partition value extractor class will be *automatically inferred* on the basis 
> of number of partition fields and whether or not hive style partitioning is 
> enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6230) Make hive sync aws support partition indexes

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-6230:

Fix Version/s: 0.15.0

> Make hive sync aws support partition indexes
> 
>
> Key: HUDI-6230
> URL: https://issues.apache.org/jira/browse/HUDI-6230
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> glue provide indexing features, that speedup a lot partition retrieval 
> So far it is not supported. Having a new hive-sync configuration to activate 
> the feature, and optionally provide which partitions columns to index would 
> be helpful.
> Also this is an operation that should not be done at creation table time, but 
> could be activated/deactivated at will
>  
> https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-1964.
---
Resolution: Duplicate

> Update guide around hive metastore and hive sync for hudi tables
> 
>
> Key: HUDI-1964
> URL: https://issues.apache.org/jira/browse/HUDI-1964
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Nishith Agarwal
>Assignee: Shiyan Xu
>Priority: Minor
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-1964:

Fix Version/s: 1.0.0

> Update guide around hive metastore and hive sync for hudi tables
> 
>
> Key: HUDI-1964
> URL: https://issues.apache.org/jira/browse/HUDI-1964
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Nishith Agarwal
>Assignee: Shiyan Xu
>Priority: Minor
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6633) Add hms based sync to hudi website

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-6633:

Fix Version/s: 1.0.0

> Add hms based sync to hudi website
> --
>
> Key: HUDI-6633
> URL: https://issues.apache.org/jira/browse/HUDI-6633
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Shiyan Xu
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
>
> we should add hms based sync to our hive sync page 
> [https://hudi.apache.org/docs/syncing_metastore]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-851.
--
Fix Version/s: 1.0.0
   (was: 0.15.0)
   Resolution: Duplicate

> Add Documentation on partitioning data with examples and details on how to 
> sync to Hive
> ---
>
> Key: HUDI-851
> URL: https://issues.apache.org/jira/browse/HUDI-851
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Bhavani Sudha
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: query-eng, user-support-issues
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-7414.
---
Resolution: Fixed

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: nadine
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs

2024-06-05 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-7414:

Fix Version/s: 1.0.0
   (was: 0.15.0)

> Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
> ---
>
> Key: HUDI-7414
> URL: https://issues.apache.org/jira/browse/HUDI-7414
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: nadine
>Assignee: Shiyan Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There was a jira issue filed where sarfaraz wanted to know more about the 
> `hoodie.gcp.bigquery.sync.base_path`.  
> In the BigQuerySyncConfig file, there a config property set: 
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103]
>   But it’s not used anywhere else in the big query code base.
> However, I see
> [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124]
>  being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}}  
> is superfluous. I’m seeing as a config being set, but not being used 
> anywhere. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7289) Fix parameters for Big Query Sync

2024-06-04 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-7289.
---
Resolution: Fixed

> Fix parameters for Big Query Sync
> -
>
> Key: HUDI-7289
> URL: https://issues.apache.org/jira/browse/HUDI-7289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Aditya Goenka
>Assignee: nadine
>Priority: Minor
> Fix For: 0.15.0
>
>
> revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/]
>  
> From a user - 
> Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is 
> missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part 
> of Hudi 0.14.1.
> Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very 
> clear, even the example is not understandable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7289) Fix parameters for Big Query Sync

2024-06-04 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu reassigned HUDI-7289:
---

Assignee: nadine  (was: Shiyan Xu)

> Fix parameters for Big Query Sync
> -
>
> Key: HUDI-7289
> URL: https://issues.apache.org/jira/browse/HUDI-7289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Aditya Goenka
>Assignee: nadine
>Priority: Minor
> Fix For: 0.15.0
>
>
> revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/]
>  
> From a user - 
> Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is 
> missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part 
> of Hudi 0.14.1.
> Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very 
> clear, even the example is not understandable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7383) CDC query failed due to dependency issue

2024-05-15 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-7383:

Fix Version/s: 0.15.0

> CDC query failed due to dependency issue
> 
>
> Key: HUDI-7383
> URL: https://issues.apache.org/jira/browse/HUDI-7383
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Shiyan Xu
>Priority: Blocker
> Fix For: 0.15.0
>
>
> {code:java}
> spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', 
> '20240205084624923', '20240205091637412');
> 24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID 
> 1515) (ip-10-0-117-21.us-west-2.compute.internal executor 3): 
> java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$
>     at 
> org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237)
>     at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:141)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$
>     at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>     ... 21 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7383) CDC query failed due to dependency issue

2024-05-15 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-7383.
---
Resolution: Fixed

> CDC query failed due to dependency issue
> 
>
> Key: HUDI-7383
> URL: https://issues.apache.org/jira/browse/HUDI-7383
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query
>Affects Versions: 0.14.0, 0.14.1
>Reporter: Shiyan Xu
>Priority: Blocker
>
> {code:java}
> spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', 
> '20240205084624923', '20240205091637412');
> 24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID 
> 1515) (ip-10-0-117-21.us-west-2.compute.internal executor 3): 
> java.lang.NoClassDefFoundError: 
> org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$
>     at 
> org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237)
>     at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>     at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:141)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$
>     at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>     ... 21 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar

2024-05-09 Thread Shiyan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845134#comment-17845134
 ] 

Shiyan Xu commented on HUDI-5616:
-

fixed [https://github.com/apache/hudi/pull/11184]

> Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
> 
>
> Key: HUDI-5616
> URL: https://issues.apache.org/jira/browse/HUDI-5616
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> There is a usability change in [this 
> PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for 
> spark users
> --conf  spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
> There will be a hit on performance (it was actually always there) if this is 
> not specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar

2024-05-09 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-5616.
---
Resolution: Fixed

> Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
> 
>
> Key: HUDI-5616
> URL: https://issues.apache.org/jira/browse/HUDI-5616
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Shiyan Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> There is a usability change in [this 
> PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for 
> spark users
> --conf  spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar
> There will be a hit on performance (it was actually always there) if this is 
> not specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5382) hoodie.datasource.write.partitionpath.field is inconsistent in the document

2024-05-09 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu closed HUDI-5382.
---
Fix Version/s: (was: 0.15.0)
   Resolution: Not A Problem

> hoodie.datasource.write.partitionpath.field is inconsistent in the document
> ---
>
> Key: HUDI-5382
> URL: https://issues.apache.org/jira/browse/HUDI-5382
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Akira Ajisaka
>Assignee: Shiyan Xu
>Priority: Minor
>
> The Hudi document is inconsistent in 
> hoodie.datasource.write.partitionpath.field and it says both required and 
> optional.
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield]
>  says required
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-1]
>  says optional
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-2]
>  says required
>  * [https://hudi.apache.org/docs/writing_data] says required
> Now I'm thinking it's optional. If it's not set, non-partitioned Hudi table 
> is created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-5382) hoodie.datasource.write.partitionpath.field is inconsistent in the document

2024-05-09 Thread Shiyan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845133#comment-17845133
 ] 

Shiyan Xu commented on HUDI-5382:
-

already fixed in the latest docs 
[https://hudi.apache.org/docs/next/writing_data]

> hoodie.datasource.write.partitionpath.field is inconsistent in the document
> ---
>
> Key: HUDI-5382
> URL: https://issues.apache.org/jira/browse/HUDI-5382
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Akira Ajisaka
>Assignee: Shiyan Xu
>Priority: Minor
> Fix For: 0.15.0
>
>
> The Hudi document is inconsistent in 
> hoodie.datasource.write.partitionpath.field and it says both required and 
> optional.
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield]
>  says required
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-1]
>  says optional
>  * 
> [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-2]
>  says required
>  * [https://hudi.apache.org/docs/writing_data] says required
> Now I'm thinking it's optional. If it's not set, non-partitioned Hudi table 
> is created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5710) Load all partitions in advance for clean when MDT is enabled

2024-05-07 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu updated HUDI-5710:

Fix Version/s: 0.14.0

> Load all partitions in advance for clean when MDT is enabled
> 
>
> Key: HUDI-5710
> URL: https://issues.apache.org/jira/browse/HUDI-5710
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning, table-service
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5710) Load all partitions in advance for clean when MDT is enabled

2024-05-07 Thread Shiyan Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiyan Xu reassigned HUDI-5710:
---

Assignee: Yue Zhang

> Load all partitions in advance for clean when MDT is enabled
> 
>
> Key: HUDI-5710
> URL: https://issues.apache.org/jira/browse/HUDI-5710
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning, table-service
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)