Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852320625 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN * 7ce0e45df128a45407b2747a6d1004036e0d3ee8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7215] Delete NewHoodieParquetFileFormat [hudi]
hudi-bot commented on PR #10304: URL: https://github.com/apache/hudi/pull/10304#issuecomment-1852305331 ## CI report: * d858eaac14b3de45d4066165622738d91ff603fe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21456) * 79c9af943ac8928216e8245752517f20893b0b42 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer [hudi]
young138120 commented on issue #10320: URL: https://github.com/apache/hudi/issues/10320#issuecomment-1852267384 I have configured the value of this parameter spark.serializer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] [hudi]
young138120 opened a new issue, #10320: URL: https://github.com/apache/hudi/issues/10320 **Describe the problem you faced** I run spark job to write data to hudi, and init spark session like this: ![image](https://github.com/apache/hudi/assets/11519151/37f69790-5cbd-44b4-94be-f2613e71f179) I mock some simple data and try to write it ![image](https://github.com/apache/hudi/assets/11519151/531e6fe9-bbcf-4a9e-a1c4-49087106e3b8) entities is list of java pojo but write fail, i confuse that the exception is so crazy ![image](https://github.com/apache/hudi/assets/11519151/8ceca114-0a39-497c-be08-06d106871ad7) why is this happening ? **Environment Description** * Hudi version : 0.9.0(huaweicloud) * Spark version : 3.1.1 * Hive version : 3.1.0 * Hadoop version : 3.1.1 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : NO **Stacktrace** 2023-12-12 23:12:30,066 | INFO | [Driver] | 异常-系统错误, message-hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer | com.jn.dwbi.spark.Launcher.main(Launcher.java:115) org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:89) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:71) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:91) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:133) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:132) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:1020) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:170) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:91) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:780) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:1020) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:446) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
hudi-bot commented on PR #10308: URL: https://github.com/apache/hudi/pull/10308#issuecomment-1852079550 ## CI report: * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464) * 14d5465e2e85b66ff4404a5c9b46f19e9c9a0e73 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21472) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Reuse table configuration between Spark Writes and HoodieStreamer [hudi]
baunz opened a new issue, #10319: URL: https://github.com/apache/hudi/issues/10319 **Describe the problem you faced** We are bootstrapping a MOR table with a spark job using bulkinsert, and periodically upsert data afterwards with HoodieStreamer. Currently, it is not clear to me which properties can be reused by using the same properties file and which have to be specified explicitly. It seems that all CLI options from HoodieStreamer need to be set, or otherwise the target table properties are overriden by the streamers default properties. Example following up: **To Reproduce** Steps to reproduce the behavior: 1. Write to table with a spark job with the following config ``` hoodie.compaction.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload hoodie.payload.ordering.field=LAST_UPDATE ``` => hoodie.properties contains this value 2. Run Deltastreamer with the same config values passed as props file => hoodie.properties contains ``` hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload ``` as it is the default value for the [cli argument](https://github.com/apache/hudi/blob/17b62a2c0f47f86b436330f2b0ea109b8c8f743c/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java#L259) **Expected behavior** Maybe a way to not have to specify properties twice (essentially all Streamer args) to reduce error probability, if i go the main cause of the problem correctly **Environment Description** * Hudi version : 0.14.0 EMR Serverless 6.15.0 S3 * Running on Docker? (yes/no) : no **Stacktrace** ``` Exception in thread "main" org.apache.hudi.exception.HoodieException: Config conflict(keycurrent value existing value): hoodie.compaction.payload.class: org.apache.hudi.common.model.DefaultHoodieRecordPayload org.apache.hudi.common.model.OverwriteWithLatestAvroPayload at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:211) at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:158) at org.apache.hudi.HoodieWriterUtils.validateTableConfig(HoodieWriterUtils.scala) at org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.(HoodieStreamer.java:683) at org.apache.hudi.utilities.streamer.HoodieStreamer.(HoodieStreamer.java:159) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
hudi-bot commented on PR #10308: URL: https://github.com/apache/hudi/pull/10308#issuecomment-1852064937 ## CI report: * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464) * 14d5465e2e85b66ff4404a5c9b46f19e9c9a0e73 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]
hudi-bot commented on PR #10317: URL: https://github.com/apache/hudi/pull/10317#issuecomment-1852047364 ## CI report: * d17847ad9ae0724c7e93fc3a8423ba069326541a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21469) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: added link and command (#10293)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new d283192def4 added link and command (#10293) d283192def4 is described below commit d283192def43a7bc9009db877933def237fec1c2 Author: Sagar Lakshmipathy <18vidhyasa...@gmail.com> AuthorDate: Tue Dec 12 05:33:14 2023 -0800 added link and command (#10293) --- website/docs/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.12.0/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.12.1/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.12.2/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.12.3/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.13.0/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.13.1/syncing_aws_glue_data_catalog.md | 15 +++ .../version-0.14.0/syncing_aws_glue_data_catalog.md | 15 +++ 8 files changed, 120 insertions(+) diff --git a/website/docs/syncing_aws_glue_data_catalog.md b/website/docs/syncing_aws_glue_data_catalog.md index 3ab47deeab7..e54c6d52887 100644 --- a/website/docs/syncing_aws_glue_data_catalog.md +++ b/website/docs/syncing_aws_glue_data_catalog.md @@ -16,3 +16,18 @@ be passed along. ```shell --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool ``` + + Running AWS Glue Catalog Sync for Spark DataSource + +To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, you can use the options mentioned in the +[AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write) + + Running AWS Glue Catalog Sync from EMR + +If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog as external metastore, you can simply run the sync from command line like below: + +```shell +cd /usr/lib/hudi/bin + +./run_sync_tool.sh --base-path s3: --database --table --partitioned-by +``` \ No newline at end of file diff --git a/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md b/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md index 0d9075993ec..1228c0b21c4 100644 --- a/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md +++ b/website/versioned_docs/version-0.12.0/syncing_aws_glue_data_catalog.md @@ -16,3 +16,18 @@ be passed along. ```shell --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool ``` + + Running AWS Glue Catalog Sync for Spark DataSource + +To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, you can use the options mentioned in the +[AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write) + + Running AWS Glue Catalog Sync from EMR + +If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog as external metastore, you can simply run the sync from command line like below: + +```shell +cd /usr/lib/hudi/bin + +./run_sync_tool.sh --base-path s3: --database --table --partitioned-by +``` \ No newline at end of file diff --git a/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md b/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md index 0d9075993ec..1228c0b21c4 100644 --- a/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md +++ b/website/versioned_docs/version-0.12.1/syncing_aws_glue_data_catalog.md @@ -16,3 +16,18 @@ be passed along. ```shell --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool ``` + + Running AWS Glue Catalog Sync for Spark DataSource + +To write a Hudi table to Amazon S3 and catalog it in AWS Glue Data Catalog, you can use the options mentioned in the +[AWS documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html#aws-glue-programming-etl-format-hudi-write) + + Running AWS Glue Catalog Sync from EMR + +If you're running HiveSyncTool on an EMR cluster backed by Glue Data Catalog as external metastore, you can simply run the sync from command line like below: + +```shell +cd /usr/lib/hudi/bin + +./run_sync_tool.sh --base-path s3: --database --table --partitioned-by +``` \ No newline at end of file diff --git a/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md b/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md index 0d9075993ec..1228c0b21c4 100644 --- a/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md +++ b/website/versioned_docs/version-0.12.2/syncing_aws_glue_data_catalog.md @@ -16,3 +16,18 @@ be passed along. ```shell --sync-tool-classes
Re: [PR] [MINOR][DOCS] Updates to Glue Catalog Sync page [hudi]
bhasudha merged PR #10293: URL: https://github.com/apache/hudi/pull/10293 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR][DOCS] Updates to Glue Catalog Sync page [hudi]
bhasudha commented on code in PR #10293: URL: https://github.com/apache/hudi/pull/10293#discussion_r1423999115 ## website/docs/syncing_aws_glue_data_catalog.md: ## @@ -16,3 +16,18 @@ be passed along. ```shell --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool ``` + + Running AWS Glue Catalog Sync for Spark DataSource Review Comment: +1 Thanks for adding this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on PR #10294: URL: https://github.com/apache/hudi/pull/10294#issuecomment-1852036173 minor nit: Please avoid intellij suggested or whitespace changes going forward. Since this can be different across individual person's settings. And gets in the way of review :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on code in PR #10294: URL: https://github.com/apache/hudi/pull/10294#discussion_r1423990484 ## website/docs/sql_queries.md: ## @@ -362,37 +349,37 @@ Following tables show whether a given query is supported on specific query engin ### Copy-On-Write tables -| Query Engine |Snapshot Queries|Incremental Queries| -|---||---| -| **Hive** |Y|Y| -| **Spark SQL** |Y|Y| -| **Flink SQL** |Y|N| -| **PrestoDB** |Y|N| -| **Trino** |Y|N| -| **AWS Athena**|Y|N| -| **BigQuery** |Y|N| -| **Impala**|Y|N| -| **Redshift Spectrum** |Y|N| -| **Doris** |Y|N| -| **StarRocks** |Y|N| -| **ClickHouse**|Y|N| +| Query Engine | Snapshot Queries | Incremental Queries | +|---|--|-| +| **Hive** | Y| Y | +| **Spark SQL** | Y| Y | +| **Flink SQL** | Y| N | +| **PrestoDB** | Y| N | +| **Trino** | Y| N | +| **AWS Athena**| Y| N | +| **BigQuery** | Y| N | +| **Impala**| Y| N | +| **Redshift Spectrum** | Y| N | +| **Doris** | Y| N | +| **StarRocks** | Y| Y | Review Comment: Are incremental queries supported in starrocks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on code in PR #10294: URL: https://github.com/apache/hudi/pull/10294#discussion_r1423989899 ## website/docs/sql_queries.md: ## @@ -362,37 +349,37 @@ Following tables show whether a given query is supported on specific query engin ### Copy-On-Write tables -| Query Engine |Snapshot Queries|Incremental Queries| -|---||---| -| **Hive** |Y|Y| -| **Spark SQL** |Y|Y| -| **Flink SQL** |Y|N| -| **PrestoDB** |Y|N| -| **Trino** |Y|N| -| **AWS Athena**|Y|N| -| **BigQuery** |Y|N| -| **Impala**|Y|N| -| **Redshift Spectrum** |Y|N| -| **Doris** |Y|N| -| **StarRocks** |Y|N| -| **ClickHouse**|Y|N| +| Query Engine | Snapshot Queries | Incremental Queries | +|---|--|-| +| **Hive** | Y| Y | +| **Spark SQL** | Y| Y | +| **Flink SQL** | Y| N | +| **PrestoDB** | Y| N | +| **Trino** | Y| N | +| **AWS Athena**| Y| N | +| **BigQuery** | Y| N | +| **Impala**| Y| N | +| **Redshift Spectrum** | Y| N | +| **Doris** | Y| N | +| **StarRocks** | Y| Y | +| **ClickHouse**| Y| N | ### Merge-On-Read tables -| Query Engine|Snapshot Queries|Incremental Queries|Read Optimized Queries| -|-||---|--| -| **Hive**|Y|Y|Y| -| **Spark SQL** |Y|Y|Y| -| **Spark Datasource** |Y|Y|Y| -| **Flink SQL** |Y|Y|Y| -| **PrestoDB**|Y|N|Y| -| **AWS Athena** |Y|N|Y| -| **Big Query** |Y|N|Y| -| **Trino** |N|N|Y| -| **Impala** |N|N|Y| -| **Redshift Spectrum** |N|N|N| -| **Doris** |N|N|N| -| **StarRocks** |N|N|N| -| **ClickHouse** |N|N|N| +| Query Engine | Snapshot Queries | Incremental Queries | Read Optimized Queries | +|---|--|-|| +| **Hive** | Y| Y | Y | +| **Spark SQL** | Y| Y | Y | +| **Spark Datasource** | Y| Y | Y | +| **Flink SQL** | Y| Y | Y | +| **PrestoDB** | Y| N | Y | +| **AWS Athena**| Y| N | Y | +| **Big Query** | Y| N | Y | +| **Trino** | N| N | Y | +| **Impala**| N| N | Y | +| **Redshift Spectrum** | N| N | Y | +| **Doris** | N| N | N | +| **StarRocks** | Y| Y | Y | Review Comment: Incremental queries are not supported correct ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on code in PR #10294: URL: https://github.com/apache/hudi/pull/10294#discussion_r1423987814 ## website/docs/sql_queries.md: ## @@ -146,15 +142,11 @@ There are 3 use cases for incremental query: the interval is a closed one: both start commit and end commit are inclusive; 3. Time Travel: consume as batch for an instant time, specify the `read.end-commit` is enough because the start commit is latest by default. -```sql Review Comment: Please retain the code samples. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on code in PR #10294: URL: https://github.com/apache/hudi/pull/10294#discussion_r1423989370 ## website/docs/sql_queries.md: ## @@ -337,10 +326,8 @@ will be supported in the future. ## StarRocks -Copy on Write tables in Apache Hudi 0.10.0 and above can be queried via StarRocks external tables from StarRocks version Review Comment: The commit message suggests incremental queries are not supported. Can we clarify explicitly if its supported or not ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] [DOCS] changes to redshift & starrocks compat matrix [hudi]
bhasudha commented on code in PR #10294: URL: https://github.com/apache/hudi/pull/10294#discussion_r1423987370 ## website/docs/sql_queries.md: ## @@ -98,44 +98,40 @@ Once the Flink Hudi tables have been registered to the Flink catalog, they can b relying on the custom Hudi input formats like Hive. Typically, notebook users and Flink SQL CLI users leverage flink sql for querying Hudi tables. Please add hudi-flink-bundle as described in the [Flink Quickstart](/docs/flink-quick-start-guide). -### Snapshot Query +### Snapshot Query By default, Flink SQL will try to use its optimized native readers (for e.g. reading parquet files) instead of Hive SerDes. Additionally, partition pruning is applied by Flink if a partition predicate is specified in the filter. Filters push down may not be supported yet (please check Flink roadmap). -```sql -select * from hudi_table/*+ OPTIONS('metadata.enabled'='true', 'read.data.skipping.enabled'='false','hoodie.metadata.index.column.stats.enable'='true')*/; Review Comment: Please retain the code samples. ## website/docs/sql_queries.md: ## @@ -98,44 +98,40 @@ Once the Flink Hudi tables have been registered to the Flink catalog, they can b relying on the custom Hudi input formats like Hive. Typically, notebook users and Flink SQL CLI users leverage flink sql for querying Hudi tables. Please add hudi-flink-bundle as described in the [Flink Quickstart](/docs/flink-quick-start-guide). -### Snapshot Query +### Snapshot Query By default, Flink SQL will try to use its optimized native readers (for e.g. reading parquet files) instead of Hive SerDes. Additionally, partition pruning is applied by Flink if a partition predicate is specified in the filter. Filters push down may not be supported yet (please check Flink roadmap). -```sql -select * from hudi_table/*+ OPTIONS('metadata.enabled'='true', 'read.data.skipping.enabled'='false','hoodie.metadata.index.column.stats.enable'='true')*/; -``` - Options -| Option Name | Required | Default | Remarks | -| --- | --- | --- | --- | -| `metadata.enabled` | `false` | false | Set to `true` to enable | -| `read.data.skipping.enabled` | `false` | false | Whether to enable data skipping for batch snapshot read, by default disabled | -| `hoodie.metadata.index.column.stats.enable` | `false` | false | Whether to enable column statistics (max/min) | -| `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics | +| Option Name | Required | Default | Remarks | +|--|--|-|--| +| `metadata.enabled` | `false` | false | Set to `true` to enable | +| `read.data.skipping.enabled` | `false` | false | Whether to enable data skipping for batch snapshot read, by default disabled | +| `hoodie.metadata.index.column.stats.enable` | `false` | false | Whether to enable column statistics (max/min)| +| `hoodie.metadata.index.column.stats.column.list` | `false` | N/A | Columns(separated by comma) to collect the column statistics | ### Streaming Query By default, the hoodie table is read as batch, that is to read the latest snapshot data set and returns. Turns on the streaming read mode by setting option `read.streaming.enabled` as `true`. Sets up option `read.start-commit` to specify the read start offset, specifies the value as `earliest` if you want to consume all the history data set. ```sql -select * from hudi_table/*+ OPTIONS('read.streaming.enabled'='true', 'read.start-commit'='earliest')*/; Review Comment: Please retain the code samples. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]
ad1happy2go commented on issue #9977: URL: https://github.com/apache/hudi/issues/9977#issuecomment-1852010551 Yes, They may be related. We missed to back port to 0.12.X minor releases. Does your original dataset also have more than 100 columns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
njalan commented on code in PR #10308: URL: https://github.com/apache/hudi/pull/10308#discussion_r1423964727 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1092,6 +1092,10 @@ class HoodieSparkSqlWriterInternal { && mergedParams.getOrElse(DataSourceWriteOptions.TABLE_TYPE.key, COPY_ON_WRITE.name) == MERGE_ON_READ.name) { mergedParams.put(HoodieTableConfig.DROP_PARTITION_COLUMNS.key, "false") } +// use meta sync database to fill hoodie.table.name if it not sets +if (!mergedParams.contains(HoodieTableConfig.DATABASE_NAME.key()) && mergedParams.contains(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key())) { Review Comment: @danny0405 Yes, I updated the comments just now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]
hudi-bot commented on PR #10318: URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851946870 ## CI report: * 32e63551638725305e5b3318816aa4a469399796 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851861748 ## CI report: * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21468) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
voonhous commented on PR #10312: URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851820623 @danny0405 @cuibo01 Read through the JIRA ticket. While I understand how the state of the TM and JM can cause the potential data loss, I am still not very sure how the TM and JM reaches that state. Can you please describe the Flink job that i can use to try and replicate this? Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]
hudi-bot commented on PR #10318: URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851799300 ## CI report: * 32e63551638725305e5b3318816aa4a469399796 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21471) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]
hudi-bot commented on PR #10318: URL: https://github.com/apache/hudi/pull/10318#issuecomment-1851786461 ## CI report: * 32e63551638725305e5b3318816aa4a469399796 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage [hudi]
danny0405 commented on issue #10315: URL: https://github.com/apache/hudi/issues/10315#issuecomment-1851775346 > but I want a config that can tell source that only reads the partition that in my configs so I do not need to use filter That does not follow the common intuition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]
hudi-bot commented on PR #10317: URL: https://github.com/apache/hudi/pull/10317#issuecomment-1851773322 ## CI report: * d17847ad9ae0724c7e93fc3a8423ba069326541a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21469) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
hudi-bot commented on PR #10313: URL: https://github.com/apache/hudi/pull/10313#issuecomment-1851773223 ## CI report: * b9ebe136bdcafc4d5bbd407691f2420ccab45adc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21466) * 5273d8cc9ed428d2ac6896f52664618ed02c98a1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21468) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
danny0405 commented on code in PR #10308: URL: https://github.com/apache/hudi/pull/10308#discussion_r1423799754 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -1092,6 +1092,10 @@ class HoodieSparkSqlWriterInternal { && mergedParams.getOrElse(DataSourceWriteOptions.TABLE_TYPE.key, COPY_ON_WRITE.name) == MERGE_ON_READ.name) { mergedParams.put(HoodieTableConfig.DROP_PARTITION_COLUMNS.key, "false") } +// use meta sync database to fill hoodie.table.name if it not sets +if (!mergedParams.contains(HoodieTableConfig.DATABASE_NAME.key()) && mergedParams.contains(HoodieSyncConfig.META_SYNC_DATABASE_NAME.key())) { Review Comment: Are you saying `hoodie.database.name` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7132. Fix Version/s: 0.14.1 Resolution: Fixed Fixed via master branch: 17b62a2c0f47f86b436330f2b0ea109b8c8f743c > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
danny0405 merged PR #10312: URL: https://github.com/apache/hudi/pull/10312 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (cacbb82254c -> 17b62a2c0f4)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from cacbb82254c [HUDI-6658] Inject filters for incremental query (#10225) add 17b62a2c0f4 [HUDI-7132] Data may be lost for flink task failure (#10312) No new revisions were added by this update. Summary of changes: .../hudi/sink/StreamWriteOperatorCoordinator.java | 7 +++--- .../sink/TestStreamWriteOperatorCoordinator.java | 29 ++ 2 files changed, 32 insertions(+), 4 deletions(-)
[jira] [Updated] (HUDI-7131) The requested schema is not compatible with the file schema
[ https://issues.apache.org/jira/browse/HUDI-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7131: - Labels: core merge pull-request-available spark (was: core merge spark) > The requested schema is not compatible with the file schema > --- > > Key: HUDI-7131 > URL: https://issues.apache.org/jira/browse/HUDI-7131 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: loukey_j >Priority: Critical > Labels: core, merge, pull-request-available, spark > Fix For: 0.14.1 > > > use global Index and data partition change , report an error: The requested > schema is not compatible with the file schema... > Why not use the schema of > org.apache.hudi.common.table.TableSchemaResolver#getTableAvroSchemaInternal > to read hudi data > > CREATE TABLE if not exists unisql.hudi_ut_time_traval > (id INT, version INT, name STRING, birthDate TIMESTAMP, inc_day STRING) USING > HUDI > PARTITIONED BY (inc_day) TBLPROPERTIES (type='cow', primaryKey='id'); > insert into unisql.hudi_ut_time_traval > select 1 as id, 1 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-01' as date) as inc_day; > select * from hudi_ut_time_traval; > +---+-+--+--++---+---+-+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name |id > |version|name |birthDate |inc_day | > +---+-+--+--++---+---+-+---+--+ > |20231122100234339 |20231122100234339_0_0|1 |inc_day=2023-10-01 > |8a510742-c060-4d12-898e-70bbd122f2e3-0_0-19-16_20231122100234339.parquet|1 > |1 |str_1|2023-01-01 12:12:12|2023-10-01| > +---+-+--+--++---+---+-+---+--+ > merge into hudi_ut_time_traval t using ( > select 1 as id, 2 as version, 'str_1' as name, cast('2023-01-01 12:12:12.0' > as timestamp) as birthDate, cast('2023-10-02' as date) as inc_day > ) s on t.id=s.id when matched THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * > Caused by: org.apache.parquet.io.ParquetDecodingException: The requested > schema is not compatible with the file schema. incompatible types: required > int32 id != optional int32 id > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:101) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81) > at > org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:55) > at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:162) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:135) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225); > parquet schema: > { > "type" : "record", > "name" : "hudi_ut_time_traval_record", > "namespace" : "hoodie.hudi_ut_time_traval", > "fields" : [ { > "name" : "_hoodie_commit_time", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_commit_seqno", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_record_key", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_partition_path", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "_hoodie_file_name", > "type" : [ "null", "string" ], > "doc" : "", > "default" : null > }, { > "name" : "id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "version", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "name", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "birthDate", > "type" : [ "null", { > "type" : "long", > "logicalType" : "timestamp-micros" > } ], > "default" : null > }, { > "name" : "inc_day", > "type" : [ "null", "string" ], > "default" : null > } ] > } > org.apache.hudi.io.HoodieMergedReadHandle#readerSchema: >
[PR] [HUDI-7131] Fixing schema used to read base file in HoodieMergedReadHandle [hudi]
nsivabalan opened a new pull request, #10318: URL: https://github.com/apache/hudi/pull/10318 ### Change Logs Fixing schema used to read base file in HoodieMergedReadHandle ### Impact MIT works for global index use-cases. ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]
hudi-bot commented on PR #10317: URL: https://github.com/apache/hudi/pull/10317#issuecomment-1851697428 ## CI report: * d17847ad9ae0724c7e93fc3a8423ba069326541a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling
[ https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7225: - Labels: pull-request-available (was: ) > Correcting spelling errors or annotations with non-standard spelling > > > Key: HUDI-7225 > URL: https://issues.apache.org/jira/browse/HUDI-7225 > Project: Apache Hudi > Issue Type: Improvement >Reporter: mazhengxuan >Priority: Minor > Labels: pull-request-available > > Modify some spelling errors or non-standard spelling comments pointed out by > Typo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7225] Correcting spelling errors or annotations with non-standa… [hudi]
LeshracTheMalicious opened a new pull request, #10317: URL: https://github.com/apache/hudi/pull/10317 …rd spelling ### Change Logs Modify some spelling errors or non-standard spelling comments pointed out by Typo ### Impact Theoretically no impact ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling
[ https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mazhengxuan updated HUDI-7225: -- Description: Modify some spelling errors or non-standard spelling comments pointed out by Typo (was: Revise some comments pointed out by Typo that are misspelled) > Correcting spelling errors or annotations with non-standard spelling > > > Key: HUDI-7225 > URL: https://issues.apache.org/jira/browse/HUDI-7225 > Project: Apache Hudi > Issue Type: Improvement >Reporter: mazhengxuan >Priority: Minor > > Modify some spelling errors or non-standard spelling comments pointed out by > Typo -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7224] HoodieSparkSqlWriter metasync success or not show details messages log [hudi]
hudi-bot commented on PR #10314: URL: https://github.com/apache/hudi/pull/10314#issuecomment-1851635963 ## CI report: * 88b9f8d9518f5afd376479ba9c87a8dd30170ffc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
hudi-bot commented on PR #10312: URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851635788 ## CI report: * 5c971e1a0cafb635ad9cfed0f452751314bdb21c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21465) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Incoming batch schema is not compatible with the table's one #9980 [hudi]
hudi-bot commented on PR #10308: URL: https://github.com/apache/hudi/pull/10308#issuecomment-1851635617 ## CI report: * 737e09fc37912e88f640393b11357cb8b27a29c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21464) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7225) Correcting spelling errors or annotations with non-standard spelling
[ https://issues.apache.org/jira/browse/HUDI-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mazhengxuan updated HUDI-7225: -- Summary: Correcting spelling errors or annotations with non-standard spelling (was: Correcting comments with incorrect spelling) > Correcting spelling errors or annotations with non-standard spelling > > > Key: HUDI-7225 > URL: https://issues.apache.org/jira/browse/HUDI-7225 > Project: Apache Hudi > Issue Type: Improvement >Reporter: mazhengxuan >Priority: Minor > > Revise some comments pointed out by Typo that are misspelled -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
prathit06 commented on code in PR #10313: URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java: ## @@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration conf, String field public static void addProjectionField(Configuration conf, String[] fieldName) { if (fieldName.length > 0) { - List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); + List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, "").split(",")).collect(Collectors.toList()); Arrays.stream(fieldName).forEach(field -> { Review Comment: - It will be used when columns list is passed in Job Configuration - It wont be used in cases where Configuration is created with empty params such as `val jobConf = new JobConf()` ( this is what we are doing currently in our Flink job to read a hoodie table) , due to this when `conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this particular fix will handle such cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? [hudi]
zyclove commented on issue #10316: URL: https://github.com/apache/hudi/issues/10316#issuecomment-1851604695 > @zyclove Dont think if there is a way to point the different directory outside table directory OR having any such TTL configuration. Why can't we consider storing metadata and data files independently? The data TTL can be more flexible and convenient. Can it be mentioned and submitted in subsequent planning meetings? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
prathit06 commented on code in PR #10313: URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java: ## @@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration conf, String field public static void addProjectionField(Configuration conf, String[] fieldName) { if (fieldName.length > 0) { - List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); + List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, "").split(",")).collect(Collectors.toList()); Arrays.stream(fieldName).forEach(field -> { Review Comment: `LIST_COLUMNS` - will be used when columns list is passed in Job Configuration - wont be used in cases where Configuration is created with empty params such as `val jobConf = new JobConf()` ( this is what we are doing currently in our Flink job to read a hoodie table) , due to this when `conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this particular fix will handle such cases ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java: ## @@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration conf, String field public static void addProjectionField(Configuration conf, String[] fieldName) { if (fieldName.length > 0) { - List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); + List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, "").split(",")).collect(Collectors.toList()); Arrays.stream(fieldName).forEach(field -> { Review Comment: `LIST_COLUMNS` - It will be used when columns list is passed in Job Configuration - It wont be used in cases where Configuration is created with empty params such as `val jobConf = new JobConf()` ( this is what we are doing currently in our Flink job to read a hoodie table) , due to this when `conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this particular fix will handle such cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7132] Data may be lost for flink task failure [hudi]
cuibo01 commented on PR #10312: URL: https://github.com/apache/hudi/pull/10312#issuecomment-1851568190 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7170) Implement HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Cui reassigned HUDI-7170: Assignee: Bo Cui (was: Ethan Guo) > Implement HFile reader independent of HBase > --- > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Bo Cui >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > We'd like to provide our own implementation o HFile reader which does not use > HBase dependencies. In the long term, we should also decouple the HFile > reader from hadoop FileSystem abstractions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] NPE fix while adding projection field & added its test cases [hudi]
prathit06 commented on code in PR #10313: URL: https://github.com/apache/hudi/pull/10313#discussion_r1423664817 ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java: ## @@ -86,7 +86,7 @@ private static Configuration addProjectionField(Configuration conf, String field public static void addProjectionField(Configuration conf, String[] fieldName) { if (fieldName.length > 0) { - List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList()); + List columnNameList = Arrays.stream(conf.get(serdeConstants.LIST_COLUMNS, "").split(",")).collect(Collectors.toList()); Arrays.stream(fieldName).forEach(field -> { Review Comment: `LIST_COLUMNS` will be used when columns list is passed in Job Configuration & it wont be used in cases where Configuration is created with empty params such as `val jobConf = new JobConf()` ( this is what we are doing currently in our Flink job to read a hoodie table) , due to this when `conf.get(serdeConstants.LIST_COLUMNS)` is invoked, it returns NPE so this particular fix will handle such cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132 ] Bo Cui deleted comment on HUDI-7132: -- was (Author: bo cui): >From the code, this pr ([https://github.com/apache/hudi/pull/9867/files]) >fixes the logic during initialization, but it doesn't fix the logic when a subtask fails, Like this logic, is my understanding correct? !screenshot-1.png|width=750,height=379! > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Cui updated HUDI-7132: - Attachment: (was: screenshot-1.png) > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795645#comment-17795645 ] Bo Cui commented on HUDI-7132: -- >From the code, this pr (https://github.com/apache/hudi/pull/9867/files) fixes >the logic during initialization, but it doesn't fix the logic when a subtask fails, Like this logic, is my understanding correct? !screenshot-1.png! > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Cui updated HUDI-7132: - Attachment: screenshot-1.png > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7132) Data may be lost in Flink checkpoint
[ https://issues.apache.org/jira/browse/HUDI-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795645#comment-17795645 ] Bo Cui edited comment on HUDI-7132 at 12/12/23 8:50 AM: >From the code, this pr ([https://github.com/apache/hudi/pull/9867/files]) >fixes the logic during initialization, but it doesn't fix the logic when a subtask fails, Like this logic, is my understanding correct? !screenshot-1.png|width=750,height=379! was (Author: bo cui): >From the code, this pr (https://github.com/apache/hudi/pull/9867/files) fixes >the logic during initialization, but it doesn't fix the logic when a subtask fails, Like this logic, is my understanding correct? !screenshot-1.png! > Data may be lost in Flink checkpoint > > > Key: HUDI-7132 > URL: https://issues.apache.org/jira/browse/HUDI-7132 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.13.1, 0.14.0 >Reporter: Bo Cui >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png > > > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L524C23-L524C35 > before the line code, eventBuffer may be updated by `subtaskFailed`, and some > elements of eventBuffer is null > https://github.com/apache/hudi/blob/a1afcdd989ce2d634290d1bd9e099a17057e6b4d/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java#L305C10-L305C21 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6979][RFC-76] support event time based compaction strategy [hudi]
waitingF commented on code in PR #10266: URL: https://github.com/apache/hudi/pull/10266#discussion_r1423651556 ## rfc/rfc-76/rfc-76.md: ## @@ -0,0 +1,238 @@ + +# RFC-[74]: [support EventTimeBasedCompactionStrategy] + +## Proposers + +- @waitingF + +## Approvers + - @ + - @ + +## Status + +JIRA: [HUDI-6979](https://issues.apache.org/jira/browse/HUDI-6979) + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Currently, to gain low ingestion latency, we can adopt the MergeOnRead table, which support appending log files and +compact log files into base file later. When querying the snapshot table (RT table) generated by MOR, +query side have to perform a compaction so that they can get all data, which is expected time-consuming causing query latency. +At the time, hudi provide read-optimized table (RO table) for low query latency just like COW. + +But currently, there is no compaction strategy based on event time, so there is no data freshness guarantee for RO table. +For cases, user want all data before a specified time, user have to query the RT table to get all data with expected high query latency. Review Comment: sure, will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7225) Correcting comments with incorrect spelling
mazhengxuan created HUDI-7225: - Summary: Correcting comments with incorrect spelling Key: HUDI-7225 URL: https://issues.apache.org/jira/browse/HUDI-7225 Project: Apache Hudi Issue Type: Improvement Reporter: mazhengxuan Revise some comments pointed out by Typo that are misspelled -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] how to config hudi table TTL in S3? The table_meta can be separated into a directory? [hudi]
ad1happy2go commented on issue #10316: URL: https://github.com/apache/hudi/issues/10316#issuecomment-1851532435 @zyclove Dont think if there is a way to point the different directory outside table directory OR having any such TTL configuration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org