[jira] [Created] (HUDI-7984) Spark SQL should allow type promotion without using schema on read
Shiyan Xu created HUDI-7984: --- Summary: Spark SQL should allow type promotion without using schema on read Key: HUDI-7984 URL: https://issues.apache.org/jira/browse/HUDI-7984 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Shiyan Xu Fix For: 1.0.0 {code:java} CREATE TABLE hudi_table ( ts BIGINT, uuid STRING, rider INT, driver STRING, fare DOUBLE, city STRING ) USING HUDI PARTITIONED BY (city); alter table hudi_table alter column rider type BIGINT; ALTER TABLE CHANGE COLUMN is not supported for changing column 'rider' with type 'IntegerType' to 'rider' with type 'LongType' {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7960) Support more partitioner in Hudi Flink integration
[ https://issues.apache.org/jira/browse/HUDI-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu reassigned HUDI-7960: --- Assignee: Zhenqiu Huang > Support more partitioner in Hudi Flink integration > --- > > Key: HUDI-7960 > URL: https://issues.apache.org/jira/browse/HUDI-7960 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s
[ https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-7929: Component/s: flink > Add Flink Hudi Example for K8s > -- > > Key: HUDI-7929 > URL: https://issues.apache.org/jira/browse/HUDI-7929 > Project: Apache Hudi > Issue Type: New Feature > Components: flink >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7929) Add Flink Hudi Example for K8s
[ https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-7929: Fix Version/s: 1.0.0 > Add Flink Hudi Example for K8s > -- > > Key: HUDI-7929 > URL: https://issues.apache.org/jira/browse/HUDI-7929 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7929) Add Flink Hudi Example for K8s
[ https://issues.apache.org/jira/browse/HUDI-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu reassigned HUDI-7929: --- Assignee: Zhenqiu Huang > Add Flink Hudi Example for K8s > -- > > Key: HUDI-7929 > URL: https://issues.apache.org/jira/browse/HUDI-7929 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7376) Cancel running test instances in Azure CI when the PR is updated
[ https://issues.apache.org/jira/browse/HUDI-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-7376. --- Assignee: Shiyan Xu (was: Raymond Xu) Resolution: Cannot Reproduce > Cancel running test instances in Azure CI when the PR is updated > > > Key: HUDI-7376 > URL: https://issues.apache.org/jira/browse/HUDI-7376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Shiyan Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4705) Support Write-on-compaction mode when query cdc on MOR tables
[ https://issues.apache.org/jira/browse/HUDI-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853214#comment-17853214 ] Shiyan Xu commented on HUDI-4705: - [~lizhiqiang] [~biyan900...@gmail.com] to clarify, CDC for spark works on MOR, just that the implementation is using write-on-indexing strategy (ref: [https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md#persisting-cdc-in-mor-write-on-indexing-vs-write-on-compaction)] We want to unify the implementation as write-on-compaction, which allows flink writer to work too. (write-on-indexing strategy does not work for flink as explained in the RFC) > Support Write-on-compaction mode when query cdc on MOR tables > - > > Key: HUDI-4705 > URL: https://issues.apache.org/jira/browse/HUDI-4705 > Project: Apache Hudi > Issue Type: New Feature > Components: compaction, spark, table-service >Reporter: Yann Byron >Priority: Major > > For the case that query cdc on MOR tables, the initial implementation use the > `Write-on-indexing` way to extract the cdc data by merging the base file and > log files in-flight. > This ticket wants to support the `Write-on-compaction` way to get the cdc > data just by reading the persisted cdc files which are written at the > compaction operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6633) Add hms based sync to hudi website
[ https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-6633. --- Resolution: Fixed > Add hms based sync to hudi website > -- > > Key: HUDI-6633 > URL: https://issues.apache.org/jira/browse/HUDI-6633 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: sivabalan narayanan >Assignee: Shiyan Xu >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > we should add hms based sync to our hive sync page > [https://hudi.apache.org/docs/syncing_metastore] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-4967: Status: Open (was: Patch Available) > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4834) Update AWSGlueCatalog syncing oage to add spark datasource example
[ https://issues.apache.org/jira/browse/HUDI-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-4834. --- Fix Version/s: 1.0.0 Resolution: Fixed > Update AWSGlueCatalog syncing oage to add spark datasource example > -- > > Key: HUDI-4834 > URL: https://issues.apache.org/jira/browse/HUDI-4834 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Bhavani Sudha >Assignee: Shiyan Xu >Priority: Minor > Labels: documentation > Fix For: 0.15.0, 1.0.0 > > > [https://hudi.apache.org/docs/next/syncing_aws_glue_data_catalog] this page > specifically talks about how to leverage this syncing mechanism via > Deltastreamer. We also need example for spark datasource here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4967) Improve docs for meta sync with TimestampBasedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-4967. --- Fix Version/s: 1.0.0 Resolution: Fixed > Improve docs for meta sync with TimestampBasedKeyGenerator > -- > > Key: HUDI-4967 > URL: https://issues.apache.org/jira/browse/HUDI-4967 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Related fix: HUDI-4966 > We need to add docs on how to properly set the meta sync configuration, > especially the hoodie.datasource.hive_sync.partition_value_extractor, in > [https://hudi.apache.org/docs/key_generation] (for different Hudi versions, > the config can be different). Check the ticket above and PR description of > [https://github.com/apache/hudi/pull/6851] for more details. > We should also add the migration setup on the key generation page as well: > [https://hudi.apache.org/releases/release-0.12.0/#configuration-updates] > * {{{}hoodie.datasource.hive_sync.partition_value_extractor{}}}: This config > is used to extract and transform partition value during Hive sync. Its > default value has been changed from > {{SlashEncodedDayPartitionValueExtractor}} to > {{{}MultiPartKeysValueExtractor{}}}. If you relied on the previous default > value (i.e., have not set it explicitly), you are required to set the config > to {{{}org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor{}}}. From > this release, if this config is not set and Hive sync is enabled, then > partition value extractor class will be *automatically inferred* on the basis > of number of partition fields and whether or not hive style partitioning is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6230) Make hive sync aws support partition indexes
[ https://issues.apache.org/jira/browse/HUDI-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-6230: Fix Version/s: 0.15.0 > Make hive sync aws support partition indexes > > > Key: HUDI-6230 > URL: https://issues.apache.org/jira/browse/HUDI-6230 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Assignee: nicolas paris >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > glue provide indexing features, that speedup a lot partition retrieval > So far it is not supported. Having a new hive-sync configuration to activate > the feature, and optionally provide which partitions columns to index would > be helpful. > Also this is an operation that should not be done at creation table time, but > could be activated/deactivated at will > > https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-1964. --- Resolution: Duplicate > Update guide around hive metastore and hive sync for hudi tables > > > Key: HUDI-1964 > URL: https://issues.apache.org/jira/browse/HUDI-1964 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Nishith Agarwal >Assignee: Shiyan Xu >Priority: Minor > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1964) Update guide around hive metastore and hive sync for hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-1964: Fix Version/s: 1.0.0 > Update guide around hive metastore and hive sync for hudi tables > > > Key: HUDI-1964 > URL: https://issues.apache.org/jira/browse/HUDI-1964 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Nishith Agarwal >Assignee: Shiyan Xu >Priority: Minor > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6633) Add hms based sync to hudi website
[ https://issues.apache.org/jira/browse/HUDI-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-6633: Fix Version/s: 1.0.0 > Add hms based sync to hudi website > -- > > Key: HUDI-6633 > URL: https://issues.apache.org/jira/browse/HUDI-6633 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: sivabalan narayanan >Assignee: Shiyan Xu >Priority: Major > Fix For: 0.15.0, 1.0.0 > > > we should add hms based sync to our hive sync page > [https://hudi.apache.org/docs/syncing_metastore] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-851) Add Documentation on partitioning data with examples and details on how to sync to Hive
[ https://issues.apache.org/jira/browse/HUDI-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-851. -- Fix Version/s: 1.0.0 (was: 0.15.0) Resolution: Duplicate > Add Documentation on partitioning data with examples and details on how to > sync to Hive > --- > > Key: HUDI-851 > URL: https://issues.apache.org/jira/browse/HUDI-851 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Bhavani Sudha >Assignee: Shiyan Xu >Priority: Minor > Labels: query-eng, user-support-issues > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-7414. --- Resolution: Fixed > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: nadine >Assignee: Shiyan Xu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7414) Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs
[ https://issues.apache.org/jira/browse/HUDI-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-7414: Fix Version/s: 1.0.0 (was: 0.15.0) > Remove hoodie.gcp.bigquery.sync.base_path reference in the gcp docs > --- > > Key: HUDI-7414 > URL: https://issues.apache.org/jira/browse/HUDI-7414 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: nadine >Assignee: Shiyan Xu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > There was a jira issue filed where sarfaraz wanted to know more about the > `hoodie.gcp.bigquery.sync.base_path`. > In the BigQuerySyncConfig file, there a config property set: > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L103] > But it’s not used anywhere else in the big query code base. > However, I see > [https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java#L124] > being used to get the base path. The {{hoodie.gcp.bigquery.sync.base_path}} > is superfluous. I’m seeing as a config being set, but not being used > anywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7289) Fix parameters for Big Query Sync
[ https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-7289. --- Resolution: Fixed > Fix parameters for Big Query Sync > - > > Key: HUDI-7289 > URL: https://issues.apache.org/jira/browse/HUDI-7289 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Aditya Goenka >Assignee: nadine >Priority: Minor > Fix For: 0.15.0 > > > revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/] > > From a user - > Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is > missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part > of Hudi 0.14.1. > Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very > clear, even the example is not understandable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7289) Fix parameters for Big Query Sync
[ https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu reassigned HUDI-7289: --- Assignee: nadine (was: Shiyan Xu) > Fix parameters for Big Query Sync > - > > Key: HUDI-7289 > URL: https://issues.apache.org/jira/browse/HUDI-7289 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Aditya Goenka >Assignee: nadine >Priority: Minor > Fix For: 0.15.0 > > > revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/] > > From a user - > Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is > missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part > of Hudi 0.14.1. > Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very > clear, even the example is not understandable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7383) CDC query failed due to dependency issue
[ https://issues.apache.org/jira/browse/HUDI-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-7383: Fix Version/s: 0.15.0 > CDC query failed due to dependency issue > > > Key: HUDI-7383 > URL: https://issues.apache.org/jira/browse/HUDI-7383 > Project: Apache Hudi > Issue Type: Bug > Components: incremental-query >Affects Versions: 0.14.0, 0.14.1 >Reporter: Shiyan Xu >Priority: Blocker > Fix For: 0.15.0 > > > {code:java} > spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', > '20240205084624923', '20240205091637412'); > 24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID > 1515) (ip-10-0-117-21.us-west-2.compute.internal executor 3): > java.lang.NoClassDefFoundError: > org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ > at > org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237) > at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.ClassNotFoundException: > org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > ... 21 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7383) CDC query failed due to dependency issue
[ https://issues.apache.org/jira/browse/HUDI-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-7383. --- Resolution: Fixed > CDC query failed due to dependency issue > > > Key: HUDI-7383 > URL: https://issues.apache.org/jira/browse/HUDI-7383 > Project: Apache Hudi > Issue Type: Bug > Components: incremental-query >Affects Versions: 0.14.0, 0.14.1 >Reporter: Shiyan Xu >Priority: Blocker > > {code:java} > spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', > '20240205084624923', '20240205091637412'); > 24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID > 1515) (ip-10-0-117-21.us-west-2.compute.internal executor 3): > java.lang.NoClassDefFoundError: > org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ > at > org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237) > at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.ClassNotFoundException: > org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$ > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > ... 21 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
[ https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845134#comment-17845134 ] Shiyan Xu commented on HUDI-5616: - fixed [https://github.com/apache/hudi/pull/11184] > Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar > > > Key: HUDI-5616 > URL: https://issues.apache.org/jira/browse/HUDI-5616 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > There is a usability change in [this > PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for > spark users > --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar > There will be a hit on performance (it was actually always there) if this is > not specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5616) Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar
[ https://issues.apache.org/jira/browse/HUDI-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-5616. --- Resolution: Fixed > Docs update for specifying org.apache.spark.HoodieSparkKryoRegistrar > > > Key: HUDI-5616 > URL: https://issues.apache.org/jira/browse/HUDI-5616 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Shiyan Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > There is a usability change in [this > PR|https://github.com/apache/hudi/pull/7702] that requires a new conf for > spark users > --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar > There will be a hit on performance (it was actually always there) if this is > not specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5382) hoodie.datasource.write.partitionpath.field is inconsistent in the document
[ https://issues.apache.org/jira/browse/HUDI-5382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu closed HUDI-5382. --- Fix Version/s: (was: 0.15.0) Resolution: Not A Problem > hoodie.datasource.write.partitionpath.field is inconsistent in the document > --- > > Key: HUDI-5382 > URL: https://issues.apache.org/jira/browse/HUDI-5382 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Akira Ajisaka >Assignee: Shiyan Xu >Priority: Minor > > The Hudi document is inconsistent in > hoodie.datasource.write.partitionpath.field and it says both required and > optional. > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield] > says required > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-1] > says optional > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-2] > says required > * [https://hudi.apache.org/docs/writing_data] says required > Now I'm thinking it's optional. If it's not set, non-partitioned Hudi table > is created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-5382) hoodie.datasource.write.partitionpath.field is inconsistent in the document
[ https://issues.apache.org/jira/browse/HUDI-5382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845133#comment-17845133 ] Shiyan Xu commented on HUDI-5382: - already fixed in the latest docs [https://hudi.apache.org/docs/next/writing_data] > hoodie.datasource.write.partitionpath.field is inconsistent in the document > --- > > Key: HUDI-5382 > URL: https://issues.apache.org/jira/browse/HUDI-5382 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Akira Ajisaka >Assignee: Shiyan Xu >Priority: Minor > Fix For: 0.15.0 > > > The Hudi document is inconsistent in > hoodie.datasource.write.partitionpath.field and it says both required and > optional. > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield] > says required > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-1] > says optional > * > [https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritepartitionpathfield-2] > says required > * [https://hudi.apache.org/docs/writing_data] says required > Now I'm thinking it's optional. If it's not set, non-partitioned Hudi table > is created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5710) Load all partitions in advance for clean when MDT is enabled
[ https://issues.apache.org/jira/browse/HUDI-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu updated HUDI-5710: Fix Version/s: 0.14.0 > Load all partitions in advance for clean when MDT is enabled > > > Key: HUDI-5710 > URL: https://issues.apache.org/jira/browse/HUDI-5710 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning, table-service >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5710) Load all partitions in advance for clean when MDT is enabled
[ https://issues.apache.org/jira/browse/HUDI-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiyan Xu reassigned HUDI-5710: --- Assignee: Yue Zhang > Load all partitions in advance for clean when MDT is enabled > > > Key: HUDI-5710 > URL: https://issues.apache.org/jira/browse/HUDI-5710 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning, table-service >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)