Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10304:
URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852320625

   
   ## CI report:
   
   * d858eaac14b3de45d4066165622738d91ff603fe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456)
 
   * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN
   * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10304:
URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852305331

   
   ## CI report:
   
   * d858eaac14b3de45d4066165622738d91ff603fe Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456)
 
   * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer [hudi]

2023-12-12 Thread via GitHub


young138120 commented on issue #10320:
URL: https://github.com/apache/hudi/issues/10320#issuecomment-1852267384

   I have configured the value of this parameter spark.serializer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] [hudi]

2023-12-12 Thread via GitHub


young138120 opened a new issue, #10320:
URL: https://github.com/apache/hudi/issues/10320

   **Describe the problem you faced**
   I run spark job to write data to hudi, and init spark session like this:
   
![image](https://github.com/apache/hudi/assets/11519151/37f69790-5cbd-44b4-94be-f2613e71f179)
   I mock some simple data and try to write it
   
![image](https://github.com/apache/hudi/assets/11519151/531e6fe9-bbcf-4a9e-a1c4-49087106e3b8)
   entities is list of java pojo
   but  write fail, i confuse that the exception is so crazy 
   
![image](https://github.com/apache/hudi/assets/11519151/8ceca114-0a39-497c-be08-06d106871ad7)
   why is this happening ? 
   
   **Environment Description**
   
   * Hudi version :
   0.9.0(huaweicloud)
   * Spark version :
   3.1.1
   * Hive version :
   3.1.0
   * Hadoop version :
   3.1.1
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   NO
   
   **Stacktrace**
   2023-12-12 23:12:30,066 | INFO  | [Driver] | 异常-系统错误, message-hoodie only 
support org.apache.spark.serializer.KryoSerializer as spark.serializer | 
com.jn.dwbi.spark.Launcher.main(Launcher.java:115)
   org.apache.hudi.exception.HoodieException: hoodie only support 
org.apache.spark.serializer.KryoSerializer as spark.serializer
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:71)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:91)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:133)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:132)
at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:1020)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:108)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:170)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:91)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:780)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:1020)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:446)
at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10308:
URL: https://github.com/apache/hudi/pull/10308#issuecomment-1852079550

   
   ## CI report:
   
   * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464)
 
   * 14d5465e2e85b66ff4404a5c9b46f19e9c9a0e73 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21472)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Reuse table configuration between Spark Writes and HoodieStreamer [hudi]

2023-12-12 Thread via GitHub


baunz opened a new issue, #10319:
URL: https://github.com/apache/hudi/issues/10319

   **Describe the problem you faced**
   
   We are bootstrapping a MOR table with a spark job using bulkinsert, and 
periodically upsert data afterwards with HoodieStreamer. 
   
   Currently, it is not clear to me which properties can be reused by using the 
same properties file and which have to be specified explicitly. It seems that 
all CLI options from HoodieStreamer need to be set, or otherwise the target 
table properties are overriden by the streamers default properties. Example 
following up:
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   1. Write to table with a spark job with the following config
   
   ```
   
hoodie.compaction.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload
   hoodie.payload.ordering.field=LAST_UPDATE
   ```
   => hoodie.properties contains this value
   
   2. Run Deltastreamer with the same config values passed as props file
   
   => hoodie.properties contains
   
   ```
   
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   ```
   
   as it is the default value for the [cli 
argument](https://github.com/apache/hudi/blob/17b62a2c0f47f86b436330f2b0ea109b8c8f743c/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java#L259)
 
   
   **Expected behavior**
   
   Maybe a way to not have to specify properties twice (essentially all 
Streamer args) to reduce error probability, if i go the main cause of the 
problem correctly
   
   **Environment Description**
   
   * Hudi version :
   0.14.0
   EMR Serverless 6.15.0
   S3
   
   * Running on Docker? (yes/no) :
   no
   **Stacktrace**
   
   ```
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Config 
conflict(keycurrent value   existing value):
   hoodie.compaction.payload.class: 
org.apache.hudi.common.model.DefaultHoodieRecordPayload 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
at 
org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:211)
at 
org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:158)
at 
org.apache.hudi.HoodieWriterUtils.validateTableConfig(HoodieWriterUtils.scala)
at 
org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.(HoodieStreamer.java:683)
at 
org.apache.hudi.utilities.streamer.HoodieStreamer.(HoodieStreamer.java:159)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10308:
URL: https://github.com/apache/hudi/pull/10308#issuecomment-1852064937

   
   ## CI report:
   
   * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464)
 
   * 14d5465e2e85b66ff4404a5c9b46f19e9c9a0e73 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10317:
URL: https://github.com/apache/hudi/pull/10317#issuecomment-1852047364

   
   ## CI report:
   
   * d17847ad9ae0724c7e93fc3a8423ba069326541a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21469)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: added link and command (#10293)

2023-12-12 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new d283192def4 added link and command (#10293)
d283192def4 is described below

commit d283192def43a7bc9009db877933def237fec1c2
Author: Sagar Lakshmipathy <18vidhyasa...@gmail.com>
AuthorDate: Tue Dec 12 05:33:14 2023 -0800

added link and command (#10293)
---
 website/docs/syncing_aws_glue_data_catalog.md | 15 +++
 .../version-0.12.0/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.12.1/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.12.2/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.12.3/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.13.0/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.13.1/syncing_aws_glue_data_catalog.md   | 15 +++
 .../version-0.14.0/syncing_aws_glue_data_catalog.md   | 15 +++
 8 files changed, 120 insertions(+)

diff --git a/website/docs/syncing_aws_glue_data_catalog.md 
b/website/docs/syncing_aws_glue_data_catalog.md
index 3ab47deeab7..e54c6d52887 100644
--- a/website/docs/syncing_aws_glue_data_catalog.md
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -16,3 +16,18 @@ be passed along.
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
+
+ Running AWS Glue Catalog Sync for Spark DataSource
+
+To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, 
you can use the options mentioned in the
+[AWS 
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
+
+ Running AWS Glue Catalog Sync from EMR
+
+If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog 
as external metastore, you can simply run the sync from command line like below:
+
+```shell
+cd /usr/lib/hudi/bin
+
+./run_sync_tool.sh --base-path s3: 
--database  --table  --partitioned-by 
+```
\ No newline at end of file
diff --git 
a/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md 
b/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md
index 0d9075993ec..1228c0b21c4 100644
--- a/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md
+++ b/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md
@@ -16,3 +16,18 @@ be passed along.
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
+
+ Running AWS Glue Catalog Sync for Spark DataSource
+
+To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, 
you can use the options mentioned in the
+[AWS 
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
+
+ Running AWS Glue Catalog Sync from EMR
+
+If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog 
as external metastore, you can simply run the sync from command line like below:
+
+```shell
+cd /usr/lib/hudi/bin
+
+./run_sync_tool.sh --base-path s3: 
--database  --table  --partitioned-by 
+```
\ No newline at end of file
diff --git 
a/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md 
b/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md
index 0d9075993ec..1228c0b21c4 100644
--- a/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md
+++ b/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md
@@ -16,3 +16,18 @@ be passed along.
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
+
+ Running AWS Glue Catalog Sync for Spark DataSource
+
+To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, 
you can use the options mentioned in the
+[AWS 
documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write)
+
+ Running AWS Glue Catalog Sync from EMR
+
+If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog 
as external metastore, you can simply run the sync from command line like below:
+
+```shell
+cd /usr/lib/hudi/bin
+
+./run_sync_tool.sh --base-path s3: 
--database  --table  --partitioned-by 
+```
\ No newline at end of file
diff --git 
a/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md 
b/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md
index 0d9075993ec..1228c0b21c4 100644
--- a/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md
+++ b/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md
@@ -16,3 +16,18 @@ be passed along.
 ```shell
 --sync-tool-classes 

Re: [PR] [MINOR][DOCS] Updates to Glue Catalog Sync page [hudi]

2023-12-12 Thread via GitHub


bhasudha merged PR #10293:
URL: https://github.com/apache/hudi/pull/10293


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR][DOCS] Updates to Glue Catalog Sync page [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10293:
URL: https://github.com/apache/hudi/pull/10293#discussion_r1423999115


##
website/docs/syncing_aws_glue_data_catalog.md:
##
@@ -16,3 +16,18 @@ be passed along.
 ```shell
 --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 ```
+
+ Running AWS Glue Catalog Sync for Spark DataSource

Review Comment:
   +1 Thanks for adding this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on PR #10294:
URL: https://github.com/apache/hudi/pull/10294#issuecomment-1852036173

   minor nit: Please avoid intellij suggested or whitespace changes going 
forward. Since this can be different across individual person's settings. And 
gets in the way of review :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10294:
URL: https://github.com/apache/hudi/pull/10294#discussion_r1423990484


##
website/docs/sql_queries.md:
##
@@ -362,37 +349,37 @@ Following tables show whether a given query is supported 
on specific query engin
 
 ### Copy-On-Write tables
 
-| Query Engine  |Snapshot Queries|Incremental Queries|
-|---||---|
-| **Hive**  |Y|Y|
-| **Spark SQL** |Y|Y|
-| **Flink SQL** |Y|N|
-| **PrestoDB**  |Y|N|
-| **Trino** |Y|N|
-| **AWS Athena**|Y|N|
-| **BigQuery**  |Y|N|
-| **Impala**|Y|N|
-| **Redshift Spectrum** |Y|N|
-| **Doris** |Y|N|
-| **StarRocks** |Y|N|
-| **ClickHouse**|Y|N|
+| Query Engine  | Snapshot Queries | Incremental Queries |
+|---|--|-|
+| **Hive**  | Y| Y   |
+| **Spark SQL** | Y| Y   |
+| **Flink SQL** | Y| N   |
+| **PrestoDB**  | Y| N   |
+| **Trino** | Y| N   |
+| **AWS Athena**| Y| N   |
+| **BigQuery**  | Y| N   |
+| **Impala**| Y| N   |
+| **Redshift Spectrum** | Y| N   |
+| **Doris** | Y| N   |
+| **StarRocks** | Y| Y   |

Review Comment:
   Are incremental queries supported in starrocks?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10294:
URL: https://github.com/apache/hudi/pull/10294#discussion_r1423989899


##
website/docs/sql_queries.md:
##
@@ -362,37 +349,37 @@ Following tables show whether a given query is supported 
on specific query engin
 
 ### Copy-On-Write tables
 
-| Query Engine  |Snapshot Queries|Incremental Queries|
-|---||---|
-| **Hive**  |Y|Y|
-| **Spark SQL** |Y|Y|
-| **Flink SQL** |Y|N|
-| **PrestoDB**  |Y|N|
-| **Trino** |Y|N|
-| **AWS Athena**|Y|N|
-| **BigQuery**  |Y|N|
-| **Impala**|Y|N|
-| **Redshift Spectrum** |Y|N|
-| **Doris** |Y|N|
-| **StarRocks** |Y|N|
-| **ClickHouse**|Y|N|
+| Query Engine  | Snapshot Queries | Incremental Queries |
+|---|--|-|
+| **Hive**  | Y| Y   |
+| **Spark SQL** | Y| Y   |
+| **Flink SQL** | Y| N   |
+| **PrestoDB**  | Y| N   |
+| **Trino** | Y| N   |
+| **AWS Athena**| Y| N   |
+| **BigQuery**  | Y| N   |
+| **Impala**| Y| N   |
+| **Redshift Spectrum** | Y| N   |
+| **Doris** | Y| N   |
+| **StarRocks** | Y| Y   |
+| **ClickHouse**| Y| N   |
 
 ### Merge-On-Read tables
 
-| Query Engine|Snapshot Queries|Incremental Queries|Read Optimized 
Queries|
-|-||---|--|
-| **Hive**|Y|Y|Y|
-| **Spark SQL**   |Y|Y|Y|
-| **Spark Datasource** |Y|Y|Y|
-| **Flink SQL**   |Y|Y|Y|
-| **PrestoDB**|Y|N|Y|
-| **AWS Athena**  |Y|N|Y|
-| **Big Query**   |Y|N|Y|
-| **Trino**   |N|N|Y|
-| **Impala**  |N|N|Y|
-| **Redshift Spectrum** |N|N|N|
-| **Doris**   |N|N|N|
-| **StarRocks**   |N|N|N|
-| **ClickHouse**  |N|N|N|
+| Query Engine  | Snapshot Queries | Incremental Queries | Read 
Optimized Queries |
+|---|--|-||
+| **Hive**  | Y| Y   | Y   
   |
+| **Spark SQL** | Y| Y   | Y   
   |
+| **Spark Datasource**  | Y| Y   | Y   
   |
+| **Flink SQL** | Y| Y   | Y   
   |
+| **PrestoDB**  | Y| N   | Y   
   |
+| **AWS Athena**| Y| N   | Y   
   |
+| **Big Query** | Y| N   | Y   
   |
+| **Trino** | N| N   | Y   
   |
+| **Impala**| N| N   | Y   
   |
+| **Redshift Spectrum** | N| N   | Y   
   |
+| **Doris** | N| N   | N   
   |
+| **StarRocks** | Y| Y   | Y   
   |

Review Comment:
   Incremental queries are not supported correct ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10294:
URL: https://github.com/apache/hudi/pull/10294#discussion_r1423987814


##
website/docs/sql_queries.md:
##
@@ -146,15 +142,11 @@ There are 3 use cases for incremental query:
the interval is a closed one: both start commit and end commit are 
inclusive;
 3. Time Travel: consume as batch for an instant time, specify the 
`read.end-commit` is enough because the start commit is latest by default.
 
-```sql

Review Comment:
   Please retain the code samples.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10294:
URL: https://github.com/apache/hudi/pull/10294#discussion_r1423989370


##
website/docs/sql_queries.md:
##
@@ -337,10 +326,8 @@ will be supported in the future.
 
 ## StarRocks
 
-Copy on Write tables in Apache Hudi 0.10.0 and above can be queried via 
StarRocks external tables from StarRocks version

Review Comment:
   The commit message suggests incremental queries are not supported. Can we 
clarify explicitly if its supported or not ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]

2023-12-12 Thread via GitHub


bhasudha commented on code in PR #10294:
URL: https://github.com/apache/hudi/pull/10294#discussion_r1423987370


##
website/docs/sql_queries.md:
##
@@ -98,44 +98,40 @@ Once the Flink Hudi tables have been registered to the 
Flink catalog, they can b
 relying on the custom Hudi input formats like Hive. Typically, notebook users 
and Flink SQL CLI users leverage flink sql for querying Hudi tables. Please add 
hudi-flink-bundle as described in the [Flink 
Quickstart](/docs/flink-quick-start-guide).
 
 
-### Snapshot Query 
+### Snapshot Query
 By default, Flink SQL will try to use its optimized native readers (for e.g. 
reading parquet files) instead of Hive SerDes.
 Additionally, partition pruning is applied by Flink if a partition predicate 
is specified in the filter. Filters push down may not be supported yet (please 
check Flink roadmap).
 
-```sql
-select * from hudi_table/*+ OPTIONS('metadata.enabled'='true', 
'read.data.skipping.enabled'='false','hoodie.metadata.index.column.stats.enable'='true')*/;

Review Comment:
   Please retain the code samples. 



##
website/docs/sql_queries.md:
##
@@ -98,44 +98,40 @@ Once the Flink Hudi tables have been registered to the 
Flink catalog, they can b
 relying on the custom Hudi input formats like Hive. Typically, notebook users 
and Flink SQL CLI users leverage flink sql for querying Hudi tables. Please add 
hudi-flink-bundle as described in the [Flink 
Quickstart](/docs/flink-quick-start-guide).
 
 
-### Snapshot Query 
+### Snapshot Query
 By default, Flink SQL will try to use its optimized native readers (for e.g. 
reading parquet files) instead of Hive SerDes.
 Additionally, partition pruning is applied by Flink if a partition predicate 
is specified in the filter. Filters push down may not be supported yet (please 
check Flink roadmap).
 
-```sql
-select * from hudi_table/*+ OPTIONS('metadata.enabled'='true', 
'read.data.skipping.enabled'='false','hoodie.metadata.index.column.stats.enable'='true')*/;
-```
-
  Options
-|  Option Name  | Required | Default | Remarks |
-|  ---  | ---  | --- | --- |
-| `metadata.enabled` | `false` | false | Set to `true` to enable |
-| `read.data.skipping.enabled` | `false` | false | Whether to enable data 
skipping for batch snapshot read, by default disabled |
-| `hoodie.metadata.index.column.stats.enable` | `false` | false | Whether to 
enable column statistics (max/min) |
-| `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | 
Columns(separated by comma) to collect the column statistics  |
+| Option Name  | Required | Default | 
Remarks  |
+|--|--|-|--|
+| `metadata.enabled`   | `false`  | false   | Set 
to `true` to enable  |
+| `read.data.skipping.enabled` | `false`  | false   | 
Whether to enable data skipping for batch snapshot read, by default disabled |
+| `hoodie.metadata.index.column.stats.enable`  | `false`  | false   | 
Whether to enable column statistics (max/min)|
+| `hoodie.metadata.index.column.stats.column.list` | `false`  | N/A | 
Columns(separated by comma) to collect the column statistics |
 
 ### Streaming Query
 By default, the hoodie table is read as batch, that is to read the latest 
snapshot data set and returns. Turns on the streaming read
 mode by setting option `read.streaming.enabled` as `true`. Sets up option 
`read.start-commit` to specify the read start offset, specifies the
 value as `earliest` if you want to consume all the history data set.
 
 ```sql
-select * from hudi_table/*+ OPTIONS('read.streaming.enabled'='true', 
'read.start-commit'='earliest')*/;

Review Comment:
   Please retain the code samples.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]

2023-12-12 Thread via GitHub


ad1happy2go commented on issue #9977:
URL: https://github.com/apache/hudi/issues/9977#issuecomment-1852010551

   Yes, They may be related. We missed to back port to 0.12.X minor releases. 
Does your original dataset also have more than 100 columns? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]

2023-12-12 Thread via GitHub


njalan commented on code in PR #10308:
URL: https://github.com/apache/hudi/pull/10308#discussion_r1423964727


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1092,6 +1092,10 @@ class HoodieSparkSqlWriterInternal {
   && mergedParams.getOrElse(DataSourceWriteOptions.TABLE_TYPE.key, 
COPY_ON_WRITE.name) == MERGE_ON_READ.name) {
   mergedParams.put(HoodieTableConfig.DROP_PARTITION_COLUMNS.key, "false")
 }
+// use meta sync database to fill hoodie.table.name if it not sets
+if (!mergedParams.contains(HoodieTableConfig.DATABASE_NAME.key()) && 
mergedParams.contains(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key())) {

Review Comment:
   @danny0405 Yes, I updated the comments just now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10318:
URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851946870

   
   ## CI report:
   
   * 32e63551638725305e5b3318816aa4a469399796 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10313:
URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851861748

   
   ## CI report:
   
   * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21468)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]

2023-12-12 Thread via GitHub


voonhous commented on PR #10312:
URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851820623

   @danny0405 @cuibo01 Read through the JIRA ticket. While I understand how the 
state of the TM and JM can cause the potential data loss, I am still not very 
sure how the TM and JM reaches that state.
   
   Can you please describe the Flink job that i can use to try and replicate 
this? 
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10318:
URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851799300

   
   ## CI report:
   
   * 32e63551638725305e5b3318816aa4a469399796 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21471)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10318:
URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851786461

   
   ## CI report:
   
   * 32e63551638725305e5b3318816aa4a469399796 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on issue #10315:
URL: https://github.com/apache/hudi/issues/10315#issuecomment-1851775346

   > but I want a config that can tell source that only reads the partition 
that in my configs so I do not need to use filter
   
   That does not follow the common intuition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10317:
URL: https://github.com/apache/hudi/pull/10317#issuecomment-1851773322

   
   ## CI report:
   
   * d17847ad9ae0724c7e93fc3a8423ba069326541a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21469)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10313:
URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851773223

   
   ## CI report:
   
   * b9ebe136bdcafc4d5bbd407691f2420ccab45adc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21466)
 
   * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21468)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]

2023-12-12 Thread via GitHub


danny0405 commented on code in PR #10308:
URL: https://github.com/apache/hudi/pull/10308#discussion_r1423799754


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -1092,6 +1092,10 @@ class HoodieSparkSqlWriterInternal {
   && mergedParams.getOrElse(DataSourceWriteOptions.TABLE_TYPE.key, 
COPY_ON_WRITE.name) == MERGE_ON_READ.name) {
   mergedParams.put(HoodieTableConfig.DROP_PARTITION_COLUMNS.key, "false")
 }
+// use meta sync database to fill hoodie.table.name if it not sets
+if (!mergedParams.contains(HoodieTableConfig.DATABASE_NAME.key()) && 
mergedParams.contains(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key())) {

Review Comment:
   Are you saying `hoodie.database.name` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7132.

Fix Version/s: 0.14.1
   Resolution: Fixed

Fixed via master branch: 17b62a2c0f47f86b436330f2b0ea109b8c8f743c

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]

2023-12-12 Thread via GitHub


danny0405 merged PR #10312:
URL: https://github.com/apache/hudi/pull/10312


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (cacbb82254c -> 17b62a2c0f4)

2023-12-12 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from cacbb82254c [HUDI-6658] Inject filters for incremental query  (#10225)
 add 17b62a2c0f4 [HUDI-7132] Data may be lost for flink task failure 
(#10312)

No new revisions were added by this update.

Summary of changes:
 .../hudi/sink/StreamWriteOperatorCoordinator.java  |  7 +++---
 .../sink/TestStreamWriteOperatorCoordinator.java   | 29 ++
 2 files changed, 32 insertions(+), 4 deletions(-)



[jira] [Updated] (HUDI-7131) The requested schema is not compatible with the file schema

2023-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7131:
-
Labels: core merge pull-request-available spark  (was: core merge spark)

> The requested schema is not compatible with the file schema
> ---
>
> Key: HUDI-7131
> URL: https://issues.apache.org/jira/browse/HUDI-7131
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: loukey_j
>Priority: Critical
>  Labels: core, merge, pull-request-available, spark
> Fix For: 0.14.1
>
>
> use global Index and data partition change , report an error: The requested 
> schema is not compatible with the file schema...
> Why not use the schema of 
> org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal 
> to read hudi data
>  
> CREATE TABLE if not exists unisql.hudi_ut_time_traval
> (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING 
> HUDI
> PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id');
> insert into unisql.hudi_ut_time_traval
> select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day;
> select * from hudi_ut_time_traval;
> +---+-+--+--++---+---+-+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno 
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id 
> |version|name |birthDate |inc_day |
> +---+-+--+--++---+---+-+---+--+
> |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 
> |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 
> |1 |str_1|2023-01-01 12:12:12|2023-10-01|
> +---+-+--+--++---+---+-+---+--+
> merge into hudi_ut_time_traval t using (
> select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' 
> as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day
> ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
> Caused by: org.apache.parquet.io.ParquetDecodingException: The requested 
> schema is not compatible with the file schema. incompatible types: required 
> int32 id != optional int32 id
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
> at 
> org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
> at org.apache.parquet.schema.MessageType.accept(MessageType.java:55)
> at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225);
> parquet schema:
> {
> "type" : "record",
> "name" : "hudi_ut_time_traval_record",
> "namespace" : "hoodie.hudi_ut_time_traval",
> "fields" : [ {
> "name" : "_hoodie_commit_time",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_commit_seqno",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_record_key",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_partition_path",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "_hoodie_file_name",
> "type" : [ "null", "string" ],
> "doc" : "",
> "default" : null
> }, {
> "name" : "id",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "version",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "name",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "birthDate",
> "type" : [ "null", {
> "type" : "long",
> "logicalType" : "timestamp-micros"
> } ],
> "default" : null
> }, {
> "name" : "inc_day",
> "type" : [ "null", "string" ],
> "default" : null
> } ]
> }
> org.apache.hudi.io.HoodieMergedReadHandle#readerSchema:
> 

[PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]

2023-12-12 Thread via GitHub


nsivabalan opened a new pull request, #10318:
URL: https://github.com/apache/hudi/pull/10318

   ### Change Logs
   
   Fixing schema used to read base file in HoodieMergedReadHandle
   
   ### Impact
   
   MIT works for global index use-cases. 
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10317:
URL: https://github.com/apache/hudi/pull/10317#issuecomment-1851697428

   
   ## CI report:
   
   * d17847ad9ae0724c7e93fc3a8423ba069326541a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling

2023-12-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7225:
-
Labels: pull-request-available  (was: )

> Correcting spelling errors or annotations with non-standard spelling
> 
>
> Key: HUDI-7225
> URL: https://issues.apache.org/jira/browse/HUDI-7225
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: mazhengxuan
>Priority: Minor
>  Labels: pull-request-available
>
> Modify some spelling errors or non-standard spelling comments pointed out by 
> Typo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]

2023-12-12 Thread via GitHub


LeshracTheMalicious opened a new pull request, #10317:
URL: https://github.com/apache/hudi/pull/10317

   …rd spelling
   
   ### Change Logs
   
   Modify some spelling errors or non-standard spelling comments pointed out by 
Typo
   
   ### Impact
   
   Theoretically no impact
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling

2023-12-12 Thread mazhengxuan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mazhengxuan updated HUDI-7225:
--
Description: Modify some spelling errors or non-standard spelling comments 
pointed out by Typo  (was: Revise some comments pointed out by Typo that are 
misspelled)

> Correcting spelling errors or annotations with non-standard spelling
> 
>
> Key: HUDI-7225
> URL: https://issues.apache.org/jira/browse/HUDI-7225
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: mazhengxuan
>Priority: Minor
>
> Modify some spelling errors or non-standard spelling comments pointed out by 
> Typo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10314:
URL: https://github.com/apache/hudi/pull/10314#issuecomment-1851635963

   
   ## CI report:
   
   * 88b9f8d9518f5afd376479ba9c87a8dd30170ffc Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10312:
URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851635788

   
   ## CI report:
   
   * 5c971e1a0cafb635ad9cfed0f452751314bdb21c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21465)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]

2023-12-12 Thread via GitHub


hudi-bot commented on PR #10308:
URL: https://github.com/apache/hudi/pull/10308#issuecomment-1851635617

   
   ## CI report:
   
   * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling

2023-12-12 Thread mazhengxuan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mazhengxuan updated HUDI-7225:
--
Summary: Correcting spelling errors or annotations with non-standard 
spelling  (was: Correcting comments with incorrect spelling)

> Correcting spelling errors or annotations with non-standard spelling
> 
>
> Key: HUDI-7225
> URL: https://issues.apache.org/jira/browse/HUDI-7225
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: mazhengxuan
>Priority: Minor
>
> Revise some comments pointed out by Typo that are misspelled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]

2023-12-12 Thread via GitHub


prathit06 commented on code in PR #10313:
URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java:
##
@@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration 
conf, String field
 
   public static void addProjectionField(Configuration conf, String[] 
fieldName) {
 if (fieldName.length > 0) {
-  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, 
"").split(",")).collect(Collectors.toList());
   Arrays.stream(fieldName).forEach(field -> {

Review Comment:
   
   - It will be used when columns list is passed in Job Configuration 
   - It wont be used in cases where Configuration is created with empty params 
such as `val jobConf = new JobConf()`  ( this is what we are doing currently in 
our Flink job to read a hoodie table) , due to this when 
`conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this 
particular fix will handle such cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? [hudi]

2023-12-12 Thread via GitHub


zyclove commented on issue #10316:
URL: https://github.com/apache/hudi/issues/10316#issuecomment-1851604695

   > @zyclove Dont think if there is a way to point the different directory 
outside table directory OR having any such TTL configuration.
   
   Why can't we consider storing metadata and data files independently? The 
data TTL can be more flexible and convenient. Can it be mentioned and submitted 
in subsequent planning meetings? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]

2023-12-12 Thread via GitHub


prathit06 commented on code in PR #10313:
URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java:
##
@@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration 
conf, String field
 
   public static void addProjectionField(Configuration conf, String[] 
fieldName) {
 if (fieldName.length > 0) {
-  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, 
"").split(",")).collect(Collectors.toList());
   Arrays.stream(fieldName).forEach(field -> {

Review Comment:
   `LIST_COLUMNS` 
   - will be used when columns list is passed in Job Configuration 
   - wont be used in cases where Configuration is created with empty params 
such as `val jobConf = new JobConf()`  ( this is what we are doing currently in 
our Flink job to read a hoodie table) , due to this when 
`conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this 
particular fix will handle such cases



##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java:
##
@@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration 
conf, String field
 
   public static void addProjectionField(Configuration conf, String[] 
fieldName) {
 if (fieldName.length > 0) {
-  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, 
"").split(",")).collect(Collectors.toList());
   Arrays.stream(fieldName).forEach(field -> {

Review Comment:
   `LIST_COLUMNS` 
   - It will be used when columns list is passed in Job Configuration 
   - It wont be used in cases where Configuration is created with empty params 
such as `val jobConf = new JobConf()`  ( this is what we are doing currently in 
our Flink job to read a hoodie table) , due to this when 
`conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this 
particular fix will handle such cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]

2023-12-12 Thread via GitHub


cuibo01 commented on PR #10312:
URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851568190

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7170) Implement HFile reader independent of HBase

2023-12-12 Thread Bo Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Cui reassigned HUDI-7170:


Assignee: Bo Cui  (was: Ethan Guo)

> Implement HFile reader independent of HBase
> ---
>
> Key: HUDI-7170
> URL: https://issues.apache.org/jira/browse/HUDI-7170
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Bo Cui
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We'd like to provide our own implementation o HFile reader which does not use 
> HBase dependencies.  In the long term, we should also decouple the HFile 
> reader from hadoop FileSystem abstractions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]

2023-12-12 Thread via GitHub


prathit06 commented on code in PR #10313:
URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java:
##
@@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration 
conf, String field
 
   public static void addProjectionField(Configuration conf, String[] 
fieldName) {
 if (fieldName.length > 0) {
-  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+  List columnNameList = 
Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, 
"").split(",")).collect(Collectors.toList());
   Arrays.stream(fieldName).forEach(field -> {

Review Comment:
   `LIST_COLUMNS` will be used when columns list is passed in Job Configuration 
& it wont be used in cases where Configuration is created with empty params 
such as `val jobConf = new JobConf()`  ( this is what we are doing currently in 
our Flink job to read a hoodie table) , due to this when 
`conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this 
particular fix will handle such cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Bo Cui (Jira)


[ https://issues.apache.org/jira/browse/HUDI-7132 ]


Bo Cui deleted comment on HUDI-7132:
--

was (Author: bo cui):
>From the code, this pr ([https://github.com/apache/hudi/pull/9867/files]) 
>fixes the logic during initialization, 
but it doesn't fix the logic when a subtask fails, Like this logic, is my 
understanding correct?
!screenshot-1.png|width=750,height=379!

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Bo Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Cui updated HUDI-7132:
-
Attachment: (was: screenshot-1.png)

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Bo Cui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795645#comment-17795645
 ] 

Bo Cui commented on HUDI-7132:
--

>From the code, this pr (https://github.com/apache/hudi/pull/9867/files) fixes 
>the logic during initialization, 
but it doesn't fix the logic when a subtask fails, Like this logic, is my 
understanding correct?
 !screenshot-1.png! 

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Bo Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Cui updated HUDI-7132:
-
Attachment: screenshot-1.png

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HUDI-7132) Data may be lost in Flink checkpoint

2023-12-12 Thread Bo Cui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795645#comment-17795645
 ] 

Bo Cui edited comment on HUDI-7132 at 12/12/23 8:50 AM:


>From the code, this pr ([https://github.com/apache/hudi/pull/9867/files]) 
>fixes the logic during initialization, 
but it doesn't fix the logic when a subtask fails, Like this logic, is my 
understanding correct?
!screenshot-1.png|width=750,height=379!


was (Author: bo cui):
>From the code, this pr (https://github.com/apache/hudi/pull/9867/files) fixes 
>the logic during initialization, 
but it doesn't fix the logic when a subtask fails, Like this logic, is my 
understanding correct?
 !screenshot-1.png! 

> Data may be lost in Flink checkpoint
> 
>
> Key: HUDI-7132
> URL: https://issues.apache.org/jira/browse/HUDI-7132
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.13.1, 0.14.0
>Reporter: Bo Cui
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png
>
>
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35
> before the line code, eventBuffer may be updated by `subtaskFailed`, and some 
> elements of eventBuffer is null
> https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]

2023-12-12 Thread via GitHub


waitingF commented on code in PR #10266:
URL: https://github.com/apache/hudi/pull/10266#discussion_r1423651556


##
rfc/rfc-76/rfc-76.md:
##
@@ -0,0 +1,238 @@
+
+# RFC-[74]: [support EventTimeBasedCompactionStrategy]
+
+## Proposers
+
+- @waitingF
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, 
which support appending log files and 
+compact log files into base file later. When querying the snapshot table (RT 
table) generated by MOR, 
+query side have to perform a compaction so that they can get all data, which 
is expected time-consuming causing query latency.
+At the time, hudi provide read-optimized table (RO table) for low query 
latency just like COW.
+
+But currently, there is no compaction strategy based on event time, so there 
is no data freshness guarantee for RO table.
+For cases, user want all data before a specified time, user have to query the 
RT table to get all data with expected high query latency.

Review Comment:
   sure, will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7225) Correcting comments with incorrect spelling

2023-12-12 Thread mazhengxuan (Jira)
mazhengxuan created HUDI-7225:
-

 Summary: Correcting comments with incorrect spelling
 Key: HUDI-7225
 URL: https://issues.apache.org/jira/browse/HUDI-7225
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: mazhengxuan


Revise some comments pointed out by Typo that are misspelled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? [hudi]

2023-12-12 Thread via GitHub


ad1happy2go commented on issue #10316:
URL: https://github.com/apache/hudi/issues/10316#issuecomment-1851532435

   @zyclove Dont think if there is a way to point the different directory 
outside table directory OR having any such TTL configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    1   2