Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]
bhasudha commented on code in PR #10080: URL: https://github.com/apache/hudi/pull/10080#discussion_r1396826530 ## website/docs/indexing.md: ## @@ -155,4 +155,11 @@ to finally check the incoming updates against all files. The `SIMPLE` Index will `HBASE` index can be employed, if the operational overhead is acceptable and would provide much better lookup times for these tables. When using a global index, users should also consider setting `hoodie.bloom.index.update.partition.path=true` or `hoodie.simple.index.update.partition.path=true` to deal with cases where the -partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. \ No newline at end of file +partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. + + +## Related Resources +Videos + +* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi Labs](https://youtu.be/XlRvMFJ7g9c) +* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY) Review Comment: First one is okay. Lets move second over to there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]
bhasudha commented on code in PR #10080: URL: https://github.com/apache/hudi/pull/10080#discussion_r1396826296 ## website/docs/indexing.md: ## @@ -155,4 +155,11 @@ to finally check the incoming updates against all files. The `SIMPLE` Index will `HBASE` index can be employed, if the operational overhead is acceptable and would provide much better lookup times for these tables. When using a global index, users should also consider setting `hoodie.bloom.index.update.partition.path=true` or `hoodie.simple.index.update.partition.path=true` to deal with cases where the -partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. \ No newline at end of file +partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. + + +## Related Resources +Videos + +* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi Labs](https://youtu.be/XlRvMFJ7g9c) +* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY) Review Comment: This one is a good fit for the page - https://hudi.apache.org/docs/next/metadata_indexing ## website/docs/indexing.md: ## @@ -155,4 +155,11 @@ to finally check the incoming updates against all files. The `SIMPLE` Index will `HBASE` index can be employed, if the operational overhead is acceptable and would provide much better lookup times for these tables. When using a global index, users should also consider setting `hoodie.bloom.index.update.partition.path=true` or `hoodie.simple.index.update.partition.path=true` to deal with cases where the -partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. \ No newline at end of file +partition path value could change due to an update e.g users table partitioned by home city; user relocates to a different city. These tables are also excellent candidates for the Merge-On-Read table type. + + +## Related Resources +Videos + +* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi Labs](https://youtu.be/XlRvMFJ7g9c) +* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY) Review Comment: Lets move it over to there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
beyond1920 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396823348 ## .github/workflows/bot.yml: ## @@ -377,15 +370,6 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark3.1' -sparkRuntime: 'spark3.1.3' - - flinkProfile: 'flink1.13' Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
beyond1920 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396822953 ## .github/workflows/bot.yml: ## @@ -302,15 +301,9 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' Review Comment: Done ## .github/workflows/bot.yml: ## @@ -377,15 +370,6 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' Review Comment: Done ## .github/workflows/bot.yml: ## @@ -302,15 +301,9 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark3.1' -sparkRuntime: 'spark3.1.3' - flinkProfile: 'flink1.14' sparkProfile: 'spark3.0' sparkRuntime: 'spark3.0.2' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark2.4' Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]
bhasudha commented on code in PR #10080: URL: https://github.com/apache/hudi/pull/10080#discussion_r1396823127 ## website/docs/key_generation.md: ## @@ -209,3 +209,8 @@ Partition path generated from key generator: "2020040118" Input field value: "20200401" Partition path generated from key generator: "04/01/2020" + +## Related Resources Review Comment: This video might be relevant for de-duping, does not address key generation. Lets remove this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815889971 ## CI report: * 47200120dd37ebaee77c583628bfddac1f564b3b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20967) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT]hudi insert is too slow & [hudi]
zyclove opened a new issue, #10131: URL: https://github.com/apache/hudi/issues/10131 **Describe the problem you faced** spark sql bulk insert data is too slow , how to turn performance. as https://hudi.apache.org/docs/performance I do change many config, but is not well. **To Reproduce** Steps to reproduce the behavior: 1. spark-sql set hudi config set hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt; set spark.sql.hive.filesourcePartitionFileCacheSize=524288000; set hoodie.metadata.table=false; set hoodie.sql.insert.mode=non-strict; set hoodie.sql.bulk.insert.enable=true; set hoodie.populate.meta.fields=false; set hoodie.parquet.compression.codec=snappy; set hoodie.bloom.index.prune.by.ranges=false; set hoodie.file.listing.parallelism=800; set hoodie.cleaner.parallelism=800; set hoodie.insert.shuffle.parallelism=800; set hoodie.upsert.shuffle.parallelism=800; set hoodie.delete.shuffle.parallelism=800; set hoodie.memory.compaction.max.size= 4294967296; set hoodie.memory.merge.max.size=107374182400; 2. sql ``` insert into bi_dw_real.dwd_smart_datapoint_report_rw_clear_rt select /*+ coalesce(${partitions}) */ md5(concat(coalesce(data_id,''),coalesce(dev_id,''),coalesce(gw_id,''),coalesce(product_id,''),coalesce(uid,''),coalesce(dp_code,''),coalesce(dp_id,''),if(dp_mode in ('ro','rw','wr'),dp_mode,'un'),coalesce(dp_name,''),coalesce(dp_time,''),coalesce(dp_type,''),coalesce(dp_value,''),coalesce(ct,''))) as id, _hoodie_record_key as uuid, data_id,dev_id,gw_id,product_id,uid, dp_code,dp_id,if(dp_mode in ('ro','rw','wr'),dp_mode,'un') as dp_mode ,dp_name,dp_time,dp_type,dp_value, ct as gmt_modified, case when length(ct)=10 then date_format(from_unixtime(ct),'MMddHH') when length(ct)=13 then date_format(from_unixtime(ct/1000),'MMddHH') else '1970010100' end as dt from hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt', 'latest_state', '${taskBeginTime}', '${next30minuteTime}') lateral view dataPointExplode(split(value,'\001')[0]) dps as ct, data_id, dev_id, gw_id, product_id, uid, dp_code, dp_id, gmtModified, dp_mode, dp_name, dp_time, dp_type, dp_value where _hoodie_commit_time >${taskBeginTime} and _hoodie_commit_time<=${next30minuteTime}; ``` 3. result table info tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'gmt_modified', hoodie.combine.before.upsert='false', hoodie.bucket.index.num.buckets=128, hoodie.compact.inline='false', hoodie.common.spillable.diskmap.type='ROCKS_DB', hoodie.datasource.write.partitionpath.field='dt,dp_mode', hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload' ) **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14.0 * Spark version :3.2.1 * Hive version :3.2.1 * Hadoop version :3.2.2 * Storage (HDFS/S3/GCS..) :s3 * Running on Docker? (yes/no) :no **Additional context** ![image](https://github.com/apache/hudi/assets/15028279/96c73e5a-d1f2-4db0-b583-37605ca754d0) ![image](https://github.com/apache/hudi/assets/15028279/1ce817b5-b372-4c09-a14c-f5ebf83f32fb) how to change parallelism? I set spark-sql --conf spark.default.parallelism=800 is not work . The follow config in sql file is not work as expect. ``` set hoodie.file.listing.parallelism=800; set hoodie.cleaner.parallelism=800; set hoodie.insert.shuffle.parallelism=800; set hoodie.upsert.shuffle.parallelism=800; set hoodie.delete.shuffle.parallelism=800; ``` The follow issues are not bulk insert . #8189 #2620 Please take a look and give me some optimization suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]
bhasudha commented on code in PR #10080: URL: https://github.com/apache/hudi/pull/10080#discussion_r1396818699 ## website/docs/concurrency_control.md: ## @@ -279,4 +279,10 @@ hoodie.cleaner.policy.failed.writes=EAGER ## Caveats If you are using the `WriteClient` API, please note that multiple writes to the table need to be initiated from 2 different instances of the write client. -It is **NOT** recommended to use the same instance of the write client to perform multi writing. \ No newline at end of file +It is **NOT** recommended to use the same instance of the write client to perform multi writing. + +## Related Resources +Videos + +* [Efficient Data Ingestion with Glue Concurrency and Hudi Data Lake](https://www.youtube.com/watch?v=G_Xd1haycqE) Review Comment: This might be just Glue concurrency. Not Hudi. We can remove the first one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]
bhasudha commented on code in PR #10080: URL: https://github.com/apache/hudi/pull/10080#discussion_r1396818335 ## website/docs/compaction.md: ## @@ -231,3 +231,5 @@ Offline compaction needs to submit the Flink task on the command line. The progr | `--seq` | `LIFO` (Optional) | The order in which compaction tasks are executed. Executing from the latest compaction plan by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the oldest plan. | | `--service` | `false` (Optional) | Whether to start a monitoring service that checks and schedules new compaction task in configured interval. | | `--min-compaction-interval-seconds` | `600(s)` (optional) | The checking interval for service mode, by default 10 minutes. | + Review Comment: Avoid whitespace changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on PR #9952: URL: https://github.com/apache/hudi/pull/9952#issuecomment-1815859946 Can we also add an example for Incremental queries under Flink in sql_queries page for completeness? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396792988 ## website/docs/sql_dml.md: ## @@ -199,7 +199,52 @@ You can control the behavior of these operations using various configuration opt ## Flink -Flink SQL also provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. All these operations are already -showcased in the [Flink Quickstart](/docs/flink-quick-start-guide). +Flink SQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update and delete data from your Hudi tables. Let's explore them one by one. +### Insert Data +You can utilize the INSERT INTO statement to incorporate data into a Hudi table using Flink SQL. Here are a few illustrative examples: + +```sql +INSERT INTO +SELECT FROM ; +``` + +Examples: + +```sql +-- Insert into a Hudi table +INSERT INTO hudi_table SELECT 1, 'a1', 20; +``` + +If the `write.operation` is 'upsert,' the INSERT INTO statement will not only insert new records but also update existing rows with the same record key. + +### Update Data +With Flink SQL, you can use update command to update the hudi table. Here are a few illustrative examples: + +```sql +UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression] +``` + +```sql +UPDATE hudi_table SET price = price * 2, ts = WHERE id = 1; +``` + +:::note Key requirements +Update query only work with batch excution mode. +::: + +### Delete Data Review Comment: Change `Delete Data` -> `Delete From` under both Spark SQL and Flink sections to be consistent with naming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7117) Functional index creation not working when table is created using datasource writer
[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-7117: - Assignee: Sagar Sumit > Functional index creation not working when table is created using datasource > writer > --- > > Key: HUDI-7117 > URL: https://issues.apache.org/jira/browse/HUDI-7117 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Aditya Goenka >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 1.0.0 > > > Details and Reproducible code under Github Issue - > [https://github.com/apache/hudi/issues/10110] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
majian1998 commented on code in PR #10120: URL: https://github.com/apache/hudi/pull/10120#discussion_r1396791841 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hudi.avro.model.{BooleanWrapper, BytesWrapper, DateWrapper, DecimalWrapper, DoubleWrapper, FloatWrapper, HoodieMetadataColumnStats, IntWrapper, LongWrapper, StringWrapper, TimeMicrosWrapper, TimestampMicrosWrapper} +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.Supplier + + +class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField("file_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("column_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("min_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("max_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("null_num", DataTypes.LongType, nullable = true, Metadata.empty) + )) + + def parameters: Array[ProcedureParameter] = PARAMETERS + + def outputType: StructType = OUTPUT_TYPE + + override def call(args: ProcedureArgs): Seq[Row] = { +super.checkArgs(PARAMETERS, args) + +val table = getArgValueOrDefault(args, PARAMETERS(0)) +val targetColumns = getArgValueOrDefault(args, PARAMETERS(1)).getOrElse("").toString +val targetColumnsSeq = targetColumns.split(",").toSeq +val basePath = getBasePath(table) +val metadataConfig = HoodieMetadataConfig.newBuilder + .enable(true) + .build +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val schemaUtil = new TableSchemaResolver(metaClient) +var schema = AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema) +val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, metadataConfig, metaClient) +val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, false) + +val rows = new util.ArrayList[Row] +colStatsRecords.collectAsList() + .stream() + .forEach(c => { +rows.add(Row(c.getFileName, c.getColumnName, getColumnStatsValue(c.getMinValue), getColumnStatsValue(c.getMaxValue), c.getNullCount.longValue())) + }) +rows.stream().toArray().map(r => r.asInstanceOf[Row]).toList + } + + def getColumnStatsValue(stats_value: Any): String = { +stats_value match { + case _: IntWrapper | + _: BooleanWrapper | + _: BytesWrapper | Review Comment: The support for this case has already been implemented, and now these two types will display the string of the array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396791058 ## website/docs/sql_dml.md: ## @@ -199,7 +199,52 @@ You can control the behavior of these operations using various configuration opt ## Flink -Flink SQL also provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. All these operations are already -showcased in the [Flink Quickstart](/docs/flink-quick-start-guide). +Flink SQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update and delete data from your Hudi tables. Let's explore them one by one. +### Insert Data +You can utilize the INSERT INTO statement to incorporate data into a Hudi table using Flink SQL. Here are a few illustrative examples: + +```sql +INSERT INTO +SELECT FROM ; +``` + +Examples: + +```sql +-- Insert into a Hudi table +INSERT INTO hudi_table SELECT 1, 'a1', 20; +``` + +If the `write.operation` is 'upsert,' the INSERT INTO statement will not only insert new records but also update existing rows with the same record key. + +### Update Data Review Comment: Update Data to `Update` for consistency? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396790547 ## website/docs/sql_dml.md: ## @@ -199,7 +199,52 @@ You can control the behavior of these operations using various configuration opt ## Flink -Flink SQL also provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. All these operations are already -showcased in the [Flink Quickstart](/docs/flink-quick-start-guide). +Flink SQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update and delete data from your Hudi tables. Let's explore them one by one. +### Insert Data +You can utilize the INSERT INTO statement to incorporate data into a Hudi table using Flink SQL. Here are a few illustrative examples: + +```sql +INSERT INTO +SELECT FROM ; +``` + +Examples: + +```sql +-- Insert into a Hudi table +INSERT INTO hudi_table SELECT 1, 'a1', 20; +``` + +If the `write.operation` is 'upsert,' the INSERT INTO statement will not only insert new records but also update existing rows with the same record key. Review Comment: Add an example on how to set this option ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7117) Functional index creation not working when table is created using datasource writer
[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7117: -- Epic Link: HUDI-512 > Functional index creation not working when table is created using datasource > writer > --- > > Key: HUDI-7117 > URL: https://issues.apache.org/jira/browse/HUDI-7117 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Aditya Goenka >Priority: Blocker > Fix For: 1.0.0 > > > Details and Reproducible code under Github Issue - > [https://github.com/apache/hudi/issues/10110] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] RFC 63 Functional Index Hudi 0.1.0-beta [hudi]
ad1happy2go commented on issue #10110: URL: https://github.com/apache/hudi/issues/10110#issuecomment-1815852059 @soumilshah1995 @codope Create JIRA to track this issue - https://issues.apache.org/jira/browse/HUDI-7117 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396788955 ## website/docs/sql_ddl.md: ## @@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog The following is an example of creating a Flink table. Read the [Flink Quick Start](/docs/flink-quick-start-guide) guide for more examples. ```sql -CREATE TABLE hudi_table2( +CREATE TABLE hudi_table( id int, name string, price double ) WITH ( 'connector' = 'hudi', -'path' = 's3://bucket-name/hudi/', +'path' = 'file:///tmp/hudi_table', 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, default is COPY_ON_WRITE ); ``` +### Create partitioned table + +The following is an example of creating a Flink partitioned table. + +```sql +CREATE TABLE hudi_table( + id BIGINT, + name STRING, + dt STRING, + hh STRING +) +PARTITIONED BY (`dt`) +WITH ( +'connector' = 'hudi', +'path' = 'file:///tmp/hudi_table', +'table.type' = 'MERGE_ON_READ' +); +``` + +### Create table with record keys and ordering fields Review Comment: Also consider adding a section on how to add Hudi specific configs to Flink similar to 'Setting Hudi configs' section in this page under Spark SQL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7117) Functional index creation not working when table is created using datasource writer
Aditya Goenka created HUDI-7117: --- Summary: Functional index creation not working when table is created using datasource writer Key: HUDI-7117 URL: https://issues.apache.org/jira/browse/HUDI-7117 Project: Apache Hudi Issue Type: Bug Components: index Reporter: Aditya Goenka Fix For: 1.0.0 Details and Reproducible code under Github Issue - [https://github.com/apache/hudi/issues/10110] -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396786024 ## website/docs/sql_dml.md: ## @@ -199,7 +199,52 @@ You can control the behavior of these operations using various configuration opt ## Flink -Flink SQL also provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. All these operations are already -showcased in the [Flink Quickstart](/docs/flink-quick-start-guide). +Flink SQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update and delete data from your Hudi tables. Let's explore them one by one. +### Insert Data Review Comment: `Insert Into` for consistent naming with Spark SQL section ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]
hudi-bot commented on PR #10130: URL: https://github.com/apache/hudi/pull/10130#issuecomment-1815846308 ## CI report: * 42ed5726bd8cbfef2bed148ed6186034b11fd9eb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20970) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815846194 ## CI report: * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964) * 29b2f98a40bd73d7cde23f8cd845fec1c5c22dd1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20969) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396784871 ## website/docs/sql_ddl.md: ## @@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog The following is an example of creating a Flink table. Read the [Flink Quick Start](/docs/flink-quick-start-guide) guide for more examples. ```sql -CREATE TABLE hudi_table2( +CREATE TABLE hudi_table( id int, name string, price double ) WITH ( 'connector' = 'hudi', -'path' = 's3://bucket-name/hudi/', +'path' = 'file:///tmp/hudi_table', 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, default is COPY_ON_WRITE ); ``` +### Create partitioned table + +The following is an example of creating a Flink partitioned table. + +```sql +CREATE TABLE hudi_table( + id BIGINT, + name STRING, + dt STRING, + hh STRING +) +PARTITIONED BY (`dt`) +WITH ( +'connector' = 'hudi', +'path' = 'file:///tmp/hudi_table', +'table.type' = 'MERGE_ON_READ' +); +``` + +### Create table with record keys and ordering fields + +The following is an example of creating a Flink table with record key and ordering field similarly to spark. + +```sql +CREATE TABLE hudi_table( + id BIGINT PRIMARY KEY NOT ENFORCED, + name STRING, + dt STRING, + hh STRING +) +PARTITIONED BY (`dt`) +WITH ( +'connector' = 'hudi', +'path' = 'file:///tmp/hudi_table', +'table.type' = 'MERGE_ON_READ', +'precombine.field' = 'hh' +); +``` + ### Alter Table Review Comment: Add syntax similar to the Spark SQL section ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396784495 ## website/docs/sql_ddl.md: ## @@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog The following is an example of creating a Flink table. Read the [Flink Quick Start](/docs/flink-quick-start-guide) guide for more examples. Review Comment: Can we follow a similar pattern as the Spark SQL one with a syntax first and then followed by the examples? I believe there can be one syntax for the create table section followed by partitioned and non-partitioned sections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]
hudi-bot commented on PR #10130: URL: https://github.com/apache/hudi/pull/10130#issuecomment-1815839243 ## CI report: * 42ed5726bd8cbfef2bed148ed6186034b11fd9eb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815839107 ## CI report: * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964) * 29b2f98a40bd73d7cde23f8cd845fec1c5c22dd1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396777966 ## website/docs/flink-quick-start-guide.md: ## @@ -215,9 +215,9 @@ HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable) builder.sink(dataStream, false); // The second parameter indicating whether the input data stream is bounded env.execute("Api_Sink"); - -// Full Quickstart Example - https://gist.github.com/ad1happy2go/1716e2e8aef6dcfe620792d6e6d86d36 ``` +Refer Full Quickstart Example [here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamWriter.java) Review Comment: Can we change the table's schema to match the Flink SQL example ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396777280 ## website/docs/flink-quick-start-guide.md: ## @@ -322,6 +359,20 @@ The `DELETE` statement is supported since Flink 1.17, so only Hudi Flink bundle Only **batch** queries on Hudi table with primary key work correctly. ::: + + + + +Creates a Flink Hudi table first and insert data into the Hudi table using DataStream API as below. +When new rows with the same primary key and Row Kind as Delete arrive in stream, then it will be be deleted. + +Refer Delete Example [here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamReader.java) Review Comment: Are the examples complete yet ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396776833 ## website/docs/flink-quick-start-guide.md: ## @@ -302,7 +331,15 @@ Only **batch** queries on Hudi table with primary key work correctly. ::: ## Delete Data {#deletes} + + ### Row-level Delete Review Comment: Markdown Format nit: Leave space above and below to render heading style -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396776055 ## website/docs/flink-quick-start-guide.md: ## @@ -287,10 +289,37 @@ Refers to [Table types and queries](/docs/concepts#table-types--queries) for mor This is similar to inserting new data. + + + + +Creates a Flink Hudi table first and insert data into the Hudi table using SQL `VALUES` as below. + ```sql +-- Update Queries only works with batch execution mode SET 'execution.runtime-mode' = 'batch'; UPDATE hudi_table SET fare = 25.0 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330'; ``` + + + + +Creates a Flink Hudi table first and insert data into the Hudi table using DataStream API as below. +When new rows with the same primary key arrive in stream, then it will be be updated. +In the insert example incoming row with same record id will be updated. + +Refer Update Example [here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamReader.java) + + + + Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time. Review Comment: Its not clear where the save mode is. Is this carried over from Spark quickstart by any chance ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Flink quickstart and sql website updates [hudi]
bhasudha commented on code in PR #9952: URL: https://github.com/apache/hudi/pull/9952#discussion_r1396770283 ## website/docs/flink-quick-start-guide.md: ## @@ -287,10 +289,37 @@ Refers to [Table types and queries](/docs/concepts#table-types--queries) for mor This is similar to inserting new data. + + + + +Creates a Flink Hudi table first and insert data into the Hudi table using SQL `VALUES` as below. + ```sql +-- Update Queries only works with batch execution mode SET 'execution.runtime-mode' = 'batch'; UPDATE hudi_table SET fare = 25.0 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330'; ``` + + + + +Creates a Flink Hudi table first and insert data into the Hudi table using DataStream API as below. Review Comment: Instead of below -> link to the insert section of this doc? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
danny0405 commented on PR #10126: URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815827193 Thanks for the help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
danny0405 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396767580 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc }; this.timelineWriter.write(instantsToArchive, Option.of(action -> deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler)); LOG.info("Deleting archived instants " + instantsToArchive); -success = deleteArchivedInstants(instantsToArchive, context); +deleteArchivedInstants(instantsToArchive, context); // triggers compaction and cleaning only after archiving action this.timelineWriter.compactAndClean(context); } else { LOG.info("No Instants to archive"); } - return success; + return instantsToArchive.size(); Review Comment: We can address it in another PR, we kind of do not have atomicity for current 2 steps: - flush of archived timeline - deletion of active metadata files My roughly thought is we can fix the left over active metadata files (should be deleted from active timeline) in the next round of archiving, imagine the latest instant time in archived timeline is t10 and the oldest instant in active timeline is t7, we should retry the deletion of instant metadat files from t7 ~ 10 at the very beginning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
danny0405 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396767580 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc }; this.timelineWriter.write(instantsToArchive, Option.of(action -> deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler)); LOG.info("Deleting archived instants " + instantsToArchive); -success = deleteArchivedInstants(instantsToArchive, context); +deleteArchivedInstants(instantsToArchive, context); // triggers compaction and cleaning only after archiving action this.timelineWriter.compactAndClean(context); } else { LOG.info("No Instants to archive"); } - return success; + return instantsToArchive.size(); Review Comment: We can address it in another PR, we kind of do not have atomicity for current 2 steps: - flush of archived timeline - deletion of active metadata files My roughly thought about it, we can fix the left over active metadata files (should be deleted from active timeline) in the next round of archiving, imagine the latest instant time in archived timeline is t10 and the oldest instant in active timeline is t7, we should retry the deletion of instant metadat files from t7 ~ 10 at the very beginning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
yihua commented on PR #10126: URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815823664 The upload is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
danny0405 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396761804 ## .github/workflows/bot.yml: ## @@ -377,15 +370,6 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark3.1' -sparkRuntime: 'spark3.1.3' - - flinkProfile: 'flink1.13' Review Comment: We have flink1.14.6 and spark 2.4.8 doker image now, can just upgrade to flink1.14. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
danny0405 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396761419 ## .github/workflows/bot.yml: ## @@ -302,15 +301,9 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark3.1' -sparkRuntime: 'spark3.1.3' - flinkProfile: 'flink1.14' sparkProfile: 'spark3.0' sparkRuntime: 'spark3.0.2' - - flinkProfile: 'flink1.13' -sparkProfile: 'spark2.4' Review Comment: We have flink1.14.6 and spark 2.4.8 doker image now, can just upgrade to flink1.14. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
danny0405 merged PR #10126: URL: https://github.com/apache/hudi/pull/10126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]
beyond1920 opened a new pull request, #10130: URL: https://github.com/apache/hudi/pull/10130 ### Change Logs After upgrade the version to 0.14.0, the performance of the Spark job, which is written into a simple bucket index table, is regressing. ![image](https://github.com/apache/hudi/assets/1525333/1652062b-3a96-4ed1-943a-9040ea7737ce) The reason is in the [PR#4480](https://github.com/apache/hudi/pull/4480), the refactor of bucket index introduce two unnecessary stages in tag for simple bucket index. `List partitions = records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();` ### Impact NA ### Documentation Update NA ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7111) Performance regression of spark job which written into simple bucket index table
[ https://issues.apache.org/jira/browse/HUDI-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7111: - Labels: pull-request-available (was: ) > Performance regression of spark job which written into simple bucket index > table > > > Key: HUDI-7111 > URL: https://issues.apache.org/jira/browse/HUDI-7111 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Jing Zhang >Priority: Major > Labels: pull-request-available > Attachments: image-2023-11-16-23-41-32-729.png > > > After upgrade the version to 0.14.0, the performance of the Spark job, which > is written into a simple bucket index table, is regressing. > !image-2023-11-16-23-41-32-729.png! > The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the > refactor of bucket index introduce two unnecessary stages in tag for simple > bucket index. > {code:java} > List partitions = > records.map(HoodieRecord::getPartitionPath).distinct().collectAsList(); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
yihua commented on PR #10126: URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815805457 The docker image `apachehudi/hudi-ci-bundle-validation-base:flink1146hive239spark248` is being uploaded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
yihua commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r1396743948 ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java: ## @@ -51,34 +51,41 @@ import java.util.Map; import java.util.stream.Collectors; +import static org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID; import static org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_DATASET_LOCATION; import static org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_DATASET_NAME; import static org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_PROJECT_ID; +import static org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER; public class HoodieBigQuerySyncClient extends HoodieSyncClient { private static final Logger LOG = LoggerFactory.getLogger(HoodieBigQuerySyncClient.class); protected final BigQuerySyncConfig config; private final String projectId; + private final String bigLakeConnectionId; private final String datasetName; + private final boolean requirePartitionFilter; private transient BigQuery bigquery; public HoodieBigQuerySyncClient(final BigQuerySyncConfig config) { super(config); this.config = config; this.projectId = config.getString(BIGQUERY_SYNC_PROJECT_ID); +this.bigLakeConnectionId = config.getString(BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID); this.datasetName = config.getString(BIGQUERY_SYNC_DATASET_NAME); +this.requirePartitionFilter = config.getBoolean(BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER); this.createBigQueryConnection(); } - @VisibleForTesting Review Comment: not sure if this need to be kept. Leave it to you to decide. ## hudi-gcp/pom.xml: ## @@ -70,7 +70,6 @@ See https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google com.google.cloud google-cloud-pubsub - ${google.cloud.pubsub.version} Review Comment: Got it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
yihua commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r1396743251 ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java: ## @@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig implements Serializable .markAdvanced() .withDocumentation("Fetch file listing from Hudi's metadata"); + public static final ConfigProperty BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty + .key("hoodie.gcp.bigquery.sync.require_partition_filter") + .defaultValue(false) + .withDocumentation("If true, configure table to require a partition filter to be specified when querying the table"); + + public static final ConfigProperty BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty + .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id") + .noDefaultValue() + .withDocumentation("The Big Lake connection ID to use"); + Review Comment: yeah -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815793435 ## CI report: * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722) * 47200120dd37ebaee77c583628bfddac1f564b3b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20967) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815788365 ## CI report: * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20965) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]
hudi-bot commented on PR #10002: URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815788149 ## CI report: * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722) * 47200120dd37ebaee77c583628bfddac1f564b3b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
hudi-bot commented on PR #10126: URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815754278 ## CI report: * 35fa17181c858c202869a4f9f7807ccb37c83438 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20966) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] DOCS-updated community sync page for new details [hudi]
nfarah86 opened a new pull request, #10129: URL: https://github.com/apache/hudi/pull/10129 @bhasudha updated the community sync page to point to the LinkedIn live events page https://github.com/apache/hudi/assets/5392555/0eb42563-38f1-4fe4-98a8-a83d0ec2d2ab;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
hudi-bot commented on PR #10126: URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815749391 ## CI report: * 35fa17181c858c202869a4f9f7807ccb37c83438 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Hudi Sink Connector is not working with Version 0.14.0 and 0.13.1 [hudi]
seethb opened a new issue, #10128: URL: https://github.com/apache/hudi/issues/10128 Hi I have followed Hudi kafka connect instructions from this document https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md and trying to setup a Hudi Sink connector in my local environment. after submitted connector config (shown below) and getting below error. curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{ "name": "hudi-kafka-test", "config": { "bootstrap.servers": "172.18.0.5:9092", "connector.class": "org.apache.hudi.connect.HoodieSinkConnector", "tasks.max": "1", "topics": "hudi-test-topic", "hoodie.table.name": "hudi-test-topic", "key.converter": "org.apache.kafka.connect.storage.StringConverter", "value.converter": "org.apache.kafka.connect.storage.StringConverter", "value.converter.schemas.enable": "false", "hoodie.table.type": "MERGE_ON_READ", "hoodie.base.path": "file:///tmp/hoodie/hudi-test-topic", "hoodie.datasource.write.recordkey.field": "volume", "hoodie.datasource.write.partitionpath.field": "date", "hoodie.schemaprovider.class": "org.apache.hudi.schema.SchemaRegistryProvider", "hoodie.deltastreamer.schemaprovider.registry.url": "http://172.18.0.7:8081/subjects/hudi-test-topic-value/versions/latest;, "hoodie.kafka.commit.interval.secs": 60 } }' SubscriptionState:399) [2023-11-15 11:05:12,065] INFO [hudi-kafka-test|task-0] New partitions added [hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:150) [2023-11-15 11:05:12,065] INFO [hudi-kafka-test|task-0] Bootstrap task for connector hudi-kafka-test with id null with assignments [hudi-test-topic-0] part [hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:185) [2023-11-15 11:05:12,067] INFO [hudi-kafka-test|task-0] Existing partitions deleted [hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:156) [2023-11-15 11:05:12,067] ERROR [hudi-kafka-test|task-0] WorkerSinkTask{id=hudi-kafka-test-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:195) java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.hudi.config.metrics.HoodieMetricsConfig.lambda$static$0(HoodieMetricsConfig.java:72) at org.apache.hudi.common.config.HoodieConfig.setDefaultValue(HoodieConfig.java:86) at org.apache.hudi.common.config.HoodieConfig.lambda$setDefaults$2(HoodieConfig.java:139) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) at org.apache.hudi.common.config.HoodieConfig.setDefaults(HoodieConfig.java:135) at org.apache.hudi.config.metrics.HoodieMetricsConfig.access$100(HoodieMetricsConfig.java:44) at org.apache.hudi.config.metrics.HoodieMetricsConfig$Builder.build(HoodieMetricsConfig.java:185) at org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:2937) at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3052) at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3047) at org.apache.hudi.connect.writers.KafkaConnectTransactionServices.(KafkaConnectTransactionServices.java:79) at org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.(ConnectTransactionCoordinator.java:88) at org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191) at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:637) at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1200(WorkerSinkTask.java:72) at
[I] [SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table [hudi]
shubhamn21 opened a new issue, #10127: URL: https://github.com/apache/hudi/issues/10127 **Describe the problem you faced** The hudi job runs fine for an hour but then crashes after a Warning about `Clean Action failure` and subsequently raising an exception `org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz-table-name/.hoodie/metadata is a valid table ` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * Spark version : 3.3.0 * Java version : 1.8 * Storage (HDFS/S3/GCS..) : S3A bucket * Running on Docker? (yes/no) : Yes, Spark on kubernetes **Stacktrace** ```23/11/17 04:00:55 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata 23/11/17 04:00:55 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231117040045263__clean__REQUESTED] org.apache.hudi.exception.HoodieIOException: Could not check if s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata is a valid table at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59) at org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:137) at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689) at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81) at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770) at org.apache.hudi.common.table.HoodieTableMetaClient.reload(HoodieTableMetaClient.java:174) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:153) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:838) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:918) at org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68) at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221) at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187) at org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256) at java.util.ArrayList.forEach(ArrayList.java:1259) at org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263) at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:759) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:731) at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.InterruptedIOException: getFileStatus on s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata/.hoodie: com.amazonaws.AbortedException: at org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337) at
[jira] [Updated] (HUDI-7116) Add docker image for flink 1.14 and spark 2.4.8
[ https://issues.apache.org/jira/browse/HUDI-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7116: - Labels: pull-request-available (was: ) > Add docker image for flink 1.14 and spark 2.4.8 > --- > > Key: HUDI-7116 > URL: https://issues.apache.org/jira/browse/HUDI-7116 > Project: Apache Hudi > Issue Type: Improvement > Components: compile >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]
danny0405 opened a new pull request, #10126: URL: https://github.com/apache/hudi/pull/10126 ### Change Logs Add the build image for flink 1.14 and spark 2.4.8. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7116) Add docker image for flink 1.14 and spark 2.4.8
Danny Chen created HUDI-7116: Summary: Add docker image for flink 1.14 and spark 2.4.8 Key: HUDI-7116 URL: https://issues.apache.org/jira/browse/HUDI-7116 Project: Apache Hudi Issue Type: Improvement Components: compile Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
danny0405 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396685712 ## .github/workflows/bot.yml: ## @@ -377,15 +370,6 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' Review Comment: We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade to flink1.14. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]
danny0405 commented on code in PR #10052: URL: https://github.com/apache/hudi/pull/10052#discussion_r1396685357 ## .github/workflows/bot.yml: ## @@ -302,15 +301,9 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' Review Comment: We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade the flink to flink1.14. ## .github/workflows/bot.yml: ## @@ -302,15 +301,9 @@ jobs: - flinkProfile: 'flink1.14' sparkProfile: 'spark3.2' sparkRuntime: 'spark3.2.3' - - flinkProfile: 'flink1.13' Review Comment: We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade to flink1.14. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815718739 ## CI report: * 1877ef99bb4c939181b5341e19c44cbba742d7cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20957) * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20965) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]
danny0405 commented on issue #10107: URL: https://github.com/apache/hudi/issues/10107#issuecomment-1815714178 We did have some fixes in recent releases: https://github.com/apache/hudi/pull/7568, https://github.com/apache/hudi/pull/8443. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
hudi-bot commented on PR #10122: URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815713303 ## CI report: * 1877ef99bb4c939181b5341e19c44cbba742d7cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20957) * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Build failed using master [hudi]
danny0405 commented on PR #9726: URL: https://github.com/apache/hudi/pull/9726#issuecomment-1815711729 If we are talking about the version coflicts between the explcity introduced jackson jar and the parquet jar, it is actually a bug, should have a fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]
danny0405 commented on code in PR #10122: URL: https://github.com/apache/hudi/pull/10122#discussion_r1396674754 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineServerHelper.java: ## @@ -23,66 +23,42 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.config.HoodieWriteConfig; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - import java.io.IOException; /** * Helper class to instantiate embedded timeline service. */ public class EmbeddedTimelineServerHelper { - private static final Logger LOG = LoggerFactory.getLogger(EmbeddedTimelineService.class); - - private static Option TIMELINE_SERVER = Option.empty(); - /** * Instantiate Embedded Timeline Server. * @param context Hoodie Engine Context * @param config Hoodie Write Config * @return TimelineServer if configured to run * @throws IOException */ - public static synchronized Option createEmbeddedTimelineService( + public static Option createEmbeddedTimelineService( HoodieEngineContext context, HoodieWriteConfig config) throws IOException { -if (config.isEmbeddedTimelineServerReuseEnabled()) { - if (!TIMELINE_SERVER.isPresent() || !TIMELINE_SERVER.get().canReuseFor(config.getBasePath())) { -TIMELINE_SERVER = Option.of(startTimelineService(context, config)); - } else { -updateWriteConfigWithTimelineServer(TIMELINE_SERVER.get(), config); - } - return TIMELINE_SERVER; -} if (config.isEmbeddedTimelineServerEnabled()) { - return Option.of(startTimelineService(context, config)); + Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); + EmbeddedTimelineService timelineService = EmbeddedTimelineService.getOrStartEmbeddedTimelineService(context, hostAddr.orElse(null), config); + updateWriteConfigWithTimelineServer(timelineService, config); Review Comment: So a single Spark Driver has chance to be shared by multiple writers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815708138 ## CI report: * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
majian1998 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396670883 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc }; this.timelineWriter.write(instantsToArchive, Option.of(action -> deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler)); LOG.info("Deleting archived instants " + instantsToArchive); -success = deleteArchivedInstants(instantsToArchive, context); +deleteArchivedInstants(instantsToArchive, context); // triggers compaction and cleaning only after archiving action this.timelineWriter.compactAndClean(context); } else { LOG.info("No Instants to archive"); } - return success; + return instantsToArchive.size(); Review Comment: In the current archive code, it seems that `success` is always set to true. The `success` variable is initialized as true, and `deleteArchivedInstants` always returns true unless it fails. However, if there is a failure, the current implementation should throw an exception and terminate without returning any value. Therefore, I believe the `success` variable is meaningless in the current context. Alternatively, should we catch these exceptions and return false? I'm not sure if this would be reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
danny0405 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396663741 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc }; this.timelineWriter.write(instantsToArchive, Option.of(action -> deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler)); LOG.info("Deleting archived instants " + instantsToArchive); -success = deleteArchivedInstants(instantsToArchive, context); +deleteArchivedInstants(instantsToArchive, context); // triggers compaction and cleaning only after archiving action this.timelineWriter.compactAndClean(context); } else { LOG.info("No Instants to archive"); } - return success; + return instantsToArchive.size(); Review Comment: Should we also include the `success` flag also as a metric? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815676361 ## CI report: * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942) * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]
haoxie-aws commented on issue #10107: URL: https://github.com/apache/hudi/issues/10107#issuecomment-1815675385 I find this query, `select * from samplehudi as t1, samplehudi as t2, samplehudi as t3 where t1.key=t2.key and t1.key=t3.key limit 1`, is slightly easier to replicate the issue. And I have managed to replicate it with a Hudi 0.12.2 writer as recommended by @ad1happy2go . I know Hudi uses `CLEAN_RETAIN_COMMITS` to prevent parquet files being cleaned while they are used by query, but is there a similar mechanism for timeline files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Build failed using master [hudi]
Forus0322 commented on PR #9726: URL: https://github.com/apache/hudi/pull/9726#issuecomment-1815673222 @codope Hi, I reanalyzed the problem. The essence is that parquet 1.10.1 contains the org.codehaus.jackson dependency package, and parquet 1.12.0 version does not contain the org.codehaus.jackson dependency package. Secondly, the JsonEncoder class under the hudi-common-base package uses the org.codehaus.jackson dependency package, and hudi-common-base references the parquet-avro dependency package, so when compiling with parquet.version 1.10.1, There is no problem; but when using parquet.version 1.12.0, there will be a missing JsonEncoder dependency problem. Therefore, this issue does not need to be included in the current PR, just specify parquet.version 1.10.1. But if you want to use parquet.version 1.12.0, you need to add an additional org.codehaus.jackson dependency. The current PR can modify the profile, not hadoop3.3, but add the org.codehaus.jackson dependency to any profile using version 1.12.0. Do you think it is necessary to modify it again? ![image](https://github.com/apache/hudi/assets/70357858/c884d3fa-f1dc-401e-bd8d-5a8d2cd30698) ![image](https://github.com/apache/hudi/assets/70357858/c83666a6-0be5-45e0-a9cf-b034338791ef) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
stream2000 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396632832 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java: ## @@ -256,8 +256,8 @@ public void testArchiveEmptyTable() throws Exception { metaClient = HoodieTableMetaClient.reload(metaClient); HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient); HoodieTimelineArchiver archiver = new HoodieTimelineArchiver(cfg, table); -boolean result = archiver.archiveIfRequired(context); -assertTrue(result); +int result = archiver.archiveIfRequired(context); Review Comment: We can add some tests for archive metrics like what we done for compaction metrics in #8759. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
hudi-bot commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815670631 ## CI report: * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942) * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
hudi-bot commented on PR #10125: URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815666263 ## CI report: * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN * f2f380dec7f0afa5fd7fb0accbe8c17e22853f00 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20963) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes [hudi]
zyclove commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-1815661842 I also encountered the same problem with 0.14.0, how to solve it? disable metadata ? set hoodie.metadata.table=false; change hoodie.parquet.small.file.limit ? set hoodie.bloom.index.prune.by.ranges = false ? change hoodie.memory.merge.max.size ? Can this be optimized in hudi 1.0? This stage is simply too time consuming. ![image](https://github.com/apache/hudi/assets/15028279/090b9a46-94fc-4d8c-9ea8-041437632761) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
majian1998 commented on PR #10101: URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815653571 Now, the archive metrics have been moved to `BaseHoodieTableServiceClient`, and the return value of archiveIfRequired has been modified. In the previous implementation, the `success` variable always seemed to be true? So, I have also removed the relevant checks in the unit tests. cc@danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Disproportionately Slow performance during "Building workload profile" phase [hudi]
zyclove commented on issue #8189: URL: https://github.com/apache/hudi/issues/8189#issuecomment-1815649576 ![image](https://github.com/apache/hudi/assets/15028279/49829897-5dbc-4a98-bde9-2ded448681a2) ![image](https://github.com/apache/hudi/assets/15028279/8753d412-d06b-475a-a106-823e785fd96b) What do these two stages do? It keeps getting stuck, which is very time-consuming. Can you give me some optimization suggestions? Thank you. hudi 0.14.0 spark 3.2.1 @nsivabalan @codope -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]
danny0405 commented on code in PR #10120: URL: https://github.com/apache/hudi/pull/10120#discussion_r1396587932 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala: ## @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hudi.avro.model.{BooleanWrapper, BytesWrapper, DateWrapper, DecimalWrapper, DoubleWrapper, FloatWrapper, HoodieMetadataColumnStats, IntWrapper, LongWrapper, StringWrapper, TimeMicrosWrapper, TimestampMicrosWrapper} +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.Supplier + + +class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField("file_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("column_name", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("min_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("max_value", DataTypes.StringType, nullable = true, Metadata.empty), +StructField("null_num", DataTypes.LongType, nullable = true, Metadata.empty) + )) + + def parameters: Array[ProcedureParameter] = PARAMETERS + + def outputType: StructType = OUTPUT_TYPE + + override def call(args: ProcedureArgs): Seq[Row] = { +super.checkArgs(PARAMETERS, args) + +val table = getArgValueOrDefault(args, PARAMETERS(0)) +val targetColumns = getArgValueOrDefault(args, PARAMETERS(1)).getOrElse("").toString +val targetColumnsSeq = targetColumns.split(",").toSeq +val basePath = getBasePath(table) +val metadataConfig = HoodieMetadataConfig.newBuilder + .enable(true) + .build +val metaClient = HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build +val schemaUtil = new TableSchemaResolver(metaClient) +var schema = AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema) +val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, metadataConfig, metaClient) +val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, false) + +val rows = new util.ArrayList[Row] +colStatsRecords.collectAsList() + .stream() + .forEach(c => { +rows.add(Row(c.getFileName, c.getColumnName, getColumnStatsValue(c.getMinValue), getColumnStatsValue(c.getMaxValue), c.getNullCount.longValue())) + }) +rows.stream().toArray().map(r => r.asInstanceOf[Row]).toList + } + + def getColumnStatsValue(stats_value: Any): String = { +stats_value match { + case _: IntWrapper | + _: BooleanWrapper | + _: BytesWrapper | Review Comment: `java.nio.HeapByteBuffer[pos=0 lim=11 cap=11]` looks not very clear -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
hudi-bot commented on PR #10125: URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815627236 ## CI report: * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961) * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN * f2f380dec7f0afa5fd7fb0accbe8c17e22853f00 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
hudi-bot commented on PR #10125: URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815621023 ## CI report: * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961) * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7109) Fix Flink may re-use a committed instant in append mode
[ https://issues.apache.org/jira/browse/HUDI-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7109: - Fix Version/s: 0.14.1 1.0.0 > Fix Flink may re-use a committed instant in append mode > --- > > Key: HUDI-7109 > URL: https://issues.apache.org/jira/browse/HUDI-7109 > Project: Apache Hudi > Issue Type: Bug >Reporter: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7109) Fix Flink may re-use a committed instant in append mode
[ https://issues.apache.org/jira/browse/HUDI-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7109. Resolution: Fixed Fixed via master branch: 3d0c4501d6b7062f38b7755f71660818cd95c1f6 > Fix Flink may re-use a committed instant in append mode > --- > > Key: HUDI-7109 > URL: https://issues.apache.org/jira/browse/HUDI-7109 > Project: Apache Hudi > Issue Type: Bug >Reporter: Yue Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (f06ff5b3e0e -> 3d0c4501d6b)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from f06ff5b3e0e [HUDI-7090] Set the maxParallelism for singleton operator (#10090) add 3d0c4501d6b [HUDI-7109] Fix Flink may re-use a committed instant in append mode (#10119) No new revisions were added by this update. Summary of changes: .../src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Re: [PR] [HUDI-7109] Fix Flink may re-use a committed instant in append mode [hudi]
danny0405 merged PR #10119: URL: https://github.com/apache/hudi/pull/10119 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7090) Set maxParallelism for singleton operator ,for example compact_plan_generate、split_monitor、compact_commit
[ https://issues.apache.org/jira/browse/HUDI-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7090. Resolution: Fixed Fixed via master branch: f06ff5b3e0ee8bb6e49aad04d3b6054d6c46e272 > Set maxParallelism for singleton operator ,for example > compact_plan_generate、split_monitor、compact_commit > --- > > Key: HUDI-7090 > URL: https://issues.apache.org/jira/browse/HUDI-7090 > Project: Apache Hudi > Issue Type: Improvement >Reporter: hehuiyuan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1, 1.0.0 > > > The parallelism of singleton operator should be strictly controlled, which > can avoid some problems caused by modifying parallelism through other means. > For example: > if set the split_monitor operator parallelism to 10 , it will read data 10 > times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7090) Set maxParallelism for singleton operator ,for example compact_plan_generate、split_monitor、compact_commit
[ https://issues.apache.org/jira/browse/HUDI-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7090: - Fix Version/s: 0.14.1 1.0.0 > Set maxParallelism for singleton operator ,for example > compact_plan_generate、split_monitor、compact_commit > --- > > Key: HUDI-7090 > URL: https://issues.apache.org/jira/browse/HUDI-7090 > Project: Apache Hudi > Issue Type: Improvement >Reporter: hehuiyuan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1, 1.0.0 > > > The parallelism of singleton operator should be strictly controlled, which > can avoid some problems caused by modifying parallelism through other means. > For example: > if set the split_monitor operator parallelism to 10 , it will read data 10 > times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7090] Set the maxParallelism for singleton operator (#10090)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new f06ff5b3e0e [HUDI-7090] Set the maxParallelism for singleton operator (#10090) f06ff5b3e0e is described below commit f06ff5b3e0ee8bb6e49aad04d3b6054d6c46e272 Author: hehuiyuan <471627...@qq.com> AuthorDate: Fri Nov 17 09:43:21 2023 +0800 [HUDI-7090] Set the maxParallelism for singleton operator (#10090) --- .../hudi/sink/clustering/HoodieFlinkClusteringJob.java | 4 +++- .../org/apache/hudi/sink/compact/HoodieFlinkCompactor.java | 4 +++- .../main/java/org/apache/hudi/sink/utils/Pipelines.java| 14 +++--- .../main/java/org/apache/hudi/table/HoodieTableSource.java | 1 + 4 files changed, 18 insertions(+), 5 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java index a453cac6803..0966f6995bd 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java @@ -348,7 +348,9 @@ public class HoodieFlinkClusteringJob { .addSink(new ClusteringCommitSink(conf)) .name("clustering_commit") .uid("uid_clustering_commit") - .setParallelism(1); + .setParallelism(1) + .getTransformation() + .setMaxParallelism(1); env.execute("flink_hudi_clustering_" + clusteringInstant.getTimestamp()); } diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java index 57e823ab21c..99dd45d94b4 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java @@ -298,7 +298,9 @@ public class HoodieFlinkCompactor { .addSink(new CompactionCommitSink(conf)) .name("compaction_commit") .uid("uid_compaction_commit") - .setParallelism(1); + .setParallelism(1) + .getTransformation() + .setMaxParallelism(1); env.execute("flink_hudi_compaction_" + String.join(",", compactionInstantTimes)); } diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java index e66009aa551..b3acd4cfa11 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java @@ -410,10 +410,11 @@ public class Pipelines { * @return the compaction pipeline */ public static DataStreamSink compact(Configuration conf, DataStream dataStream) { -return dataStream.transform("compact_plan_generate", +DataStreamSink compactionCommitEventDataStream = dataStream.transform("compact_plan_generate", TypeInformation.of(CompactionPlanEvent.class), new CompactionPlanOperator(conf)) .setParallelism(1) // plan generate must be singleton +.setMaxParallelism(1) // make the distribution strategy deterministic to avoid concurrent modifications // on the same bucket files .keyBy(plan -> plan.getOperation().getFileGroupId().getFileId()) @@ -424,6 +425,8 @@ public class Pipelines { .addSink(new CompactionCommitSink(conf)) .name("compact_commit") .setParallelism(1); // compaction commit should be singleton +compactionCommitEventDataStream.getTransformation().setMaxParallelism(1); +return compactionCommitEventDataStream; } /** @@ -452,6 +455,7 @@ public class Pipelines { TypeInformation.of(ClusteringPlanEvent.class), new ClusteringPlanOperator(conf)) .setParallelism(1) // plan generate must be singleton +.setMaxParallelism(1) // plan generate must be singleton .keyBy(plan -> // make the distribution strategy deterministic to avoid concurrent modifications // on the same bucket files @@ -465,15 +469,19 @@ public class Pipelines { ExecNodeUtil.setManagedMemoryWeight(clusteringStream.getTransformation(), conf.getInteger(FlinkOptions.WRITE_SORT_MEMORY) * 1024L * 1024L); } -return clusteringStream.addSink(new ClusteringCommitSink(conf)) +DataStreamSink clusteringCommitEventDataStream =
Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]
danny0405 merged PR #10090: URL: https://github.com/apache/hudi/pull/10090 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]When hudi integrates hive, an error is reported when the hive external table is queried [hudi]
danny0405 commented on issue #10084: URL: https://github.com/apache/hudi/issues/10084#issuecomment-1815611713 Flink 1.13.1 should use Parquet 1.11 right? Have you checked the project parquet version for other modules so you do not package multiple parquet jars in one shot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]
danny0405 commented on code in PR #10101: URL: https://github.com/apache/hudi/pull/10101#discussion_r1396571766 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java: ## @@ -117,6 +122,10 @@ public boolean archiveIfRequired(HoodieEngineContext context, boolean acquireLoc } else { LOG.info("No Instants to archive"); } + if (success && timerContext != null) { +long durationMs = metrics.getDurationInMs(timerContext.stop()); Review Comment: Yeah, need to think through what the return type should look like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
the-other-tim-brown commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r1396557170 ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java: ## @@ -79,7 +78,7 @@ public BigQuerySyncTool(Properties props) { this.bqSchemaResolver = BigQuerySchemaResolver.getInstance(); } - @VisibleForTesting // allows us to pass in mocks for the writer and client + // allows us to pass in mocks for the writer and client Review Comment: yes, accidentally removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
hudi-bot commented on PR #10125: URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815576193 ## CI report: * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
the-other-tim-brown commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r139668 ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java: ## @@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig implements Serializable .markAdvanced() .withDocumentation("Fetch file listing from Hudi's metadata"); + public static final ConfigProperty BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty + .key("hoodie.gcp.bigquery.sync.require_partition_filter") + .defaultValue(false) + .withDocumentation("If true, configure table to require a partition filter to be specified when querying the table"); + + public static final ConfigProperty BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty + .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id") + .noDefaultValue() + .withDocumentation("The Big Lake connection ID to use"); + Review Comment: should it be 0.14.1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
the-other-tim-brown commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r1396555489 ## hudi-gcp/pom.xml: ## @@ -70,7 +70,6 @@ See https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google com.google.cloud google-cloud-pubsub - ${google.cloud.pubsub.version} Review Comment: We include the bom above so this was not required -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
yihua commented on code in PR #10125: URL: https://github.com/apache/hudi/pull/10125#discussion_r1396550878 ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java: ## @@ -83,7 +83,6 @@ public class BigQuerySyncConfig extends HoodieSyncConfig implements Serializable .key("hoodie.gcp.bigquery.sync.use_bq_manifest_file") .defaultValue(false) .markAdvanced() - .sinceVersion("0.14.0") Review Comment: nit: this is still needed. ## hudi-gcp/pom.xml: ## @@ -70,7 +70,6 @@ See https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google com.google.cloud google-cloud-pubsub - ${google.cloud.pubsub.version} Review Comment: Is this safe to remove now? ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java: ## @@ -79,7 +78,7 @@ public BigQuerySyncTool(Properties props) { this.bqSchemaResolver = BigQuerySchemaResolver.getInstance(); } - @VisibleForTesting // allows us to pass in mocks for the writer and client + // allows us to pass in mocks for the writer and client Review Comment: Is the annotation still needed? ## hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java: ## @@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig implements Serializable .markAdvanced() .withDocumentation("Fetch file listing from Hudi's metadata"); + public static final ConfigProperty BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty + .key("hoodie.gcp.bigquery.sync.require_partition_filter") + .defaultValue(false) + .withDocumentation("If true, configure table to require a partition filter to be specified when querying the table"); + + public static final ConfigProperty BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty + .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id") + .noDefaultValue() + .withDocumentation("The Big Lake connection ID to use"); + Review Comment: Add `.sinceVersion("0.14.0")` and mark them advanced (`.markAdvanced()`, if they are not required to be set by the users)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
hudi-bot commented on PR #10125: URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815570231 ## CI report: * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7115) Add more options for BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7115: - Labels: pull-request-available (was: ) > Add more options for BigQuery Sync > -- > > Key: HUDI-7115 > URL: https://issues.apache.org/jira/browse/HUDI-7115 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Priority: Major > Labels: pull-request-available > > There are options for requiring a partition filter and adding a big lake > connection ID to leverage some new access control features that users may > want to leverage in their environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]
the-other-tim-brown opened a new pull request, #10125: URL: https://github.com/apache/hudi/pull/10125 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7115) Add more options for BigQuery Sync
Timothy Brown created HUDI-7115: --- Summary: Add more options for BigQuery Sync Key: HUDI-7115 URL: https://issues.apache.org/jira/browse/HUDI-7115 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown There are options for requiring a partition filter and adding a big lake connection ID to leverage some new access control features that users may want to leverage in their environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] RFC 63 Functional Index Hudi 0.1.0-beta [hudi]
soumilshah1995 commented on issue #10110: URL: https://github.com/apache/hudi/issues/10110#issuecomment-1815424659 # Code ``` from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, TimestampType, FloatType from datetime import datetime import os import sys from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, TimestampType, FloatType from datetime import datetime import os import sys HUDI_VERSION = '1.0.0-beta1' SPARK_VERSION = '3.4' SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} pyspark-shell" os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS os.environ['PYSPARK_PYTHON'] = sys.executable # Spark session spark = SparkSession.builder \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \ .config('className', 'org.apache.hudi') \ .config('spark.sql.hive.convertMetastoreParquet', 'false') \ .getOrCreate() data = [ [1695159649, '334e26e9-8355-45cc-97c6-c31daf0df330', 'rider-A', 'driver-K', 19.10, 'san_francisco'], [1695159649, 'e96c4396-3fad-413a-a942-4cb36106d721', 'rider-C', 'driver-M', 27.70, 'san_francisco'], ] # Define schema for the DataFrame schema = StructType([ StructField("ts", StringType(), True), StructField("transaction_id", StringType(), True), StructField("rider", StringType(), True), StructField("driver", StringType(), True), StructField("price", FloatType(), True), StructField("location", StringType(), True), ]) # Create Spark DataFrame df = spark.createDataFrame(data, schema=schema) df.show() path = 'file:///Users/soumilnitinshah/Downloads/hudidb/hudi_table_func_index' hudi_options = { 'hoodie.table.name': 'hudi_table_func_index', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.recordkey.field': 'transaction_id', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.table.metadata.enable': 'true', 'hoodie.datasource.write.partitionpath.field': 'location', 'hoodie.parquet.small.file.limit':'0' } df.write.format("hudi").options(**hudi_options).mode("append").save(path) PATH = 'file:///Users/soumilnitinshah/Downloads/hudidb/hudi_table_func_index' TABLE_NAME = "hudi_table_func_index" spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME) spark.sql(f"""SELECT from_unixtime(ts, '-MM-dd') as datestr FROM {TABLE_NAME}""").show() spark.sql(f"""CREATE INDEX {TABLE_NAME}_datestr ON {TABLE_NAME} USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')""") ``` # Error ``` +--+ | datestr| +--+ |2023-09-19| |2023-09-19| +--+ --- Py4JJavaError Traceback (most recent call last) Cell In[7], line 7 4 spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME) 6 spark.sql(f"""SELECT from_unixtime(ts, '-MM-dd') as datestr FROM {TABLE_NAME}""").show() > 7 spark.sql(f"""CREATE INDEX {TABLE_NAME}_datestr ON {TABLE_NAME} USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')""") File ~/anaconda3/lib/python3.11/site-packages/pyspark/sql/session.py:1440, in SparkSession.sql(self, sqlQuery, args, **kwargs) 1438 try: 1439 litArgs = {k: _to_java_column(lit(v)) for k, v in (args or {}).items()} -> 1440 return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self) 1441 finally: 1442 if len(kwargs) > 0: File ~/anaconda3/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File ~/anaconda3/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception..deco(*a, **kw) 167 def deco(*a: Any, **kw: Any) -> Any: 168 try: --> 169 return f(*a, **kw) 170 except Py4JJavaError as e: 171 converted = convert_exception(e.java_exception) File
Re: [PR] [MINOR] Fix default config values if not specified in MultipleSparkJobExecutionStrategy [hudi]
nsivabalan commented on code in PR #9625: URL: https://github.com/apache/hudi/pull/9625#discussion_r1396445022 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java: ## @@ -116,7 +116,7 @@ public HoodieWriteMetadata> performClustering(final Hood Stream> writeStatusesStream = FutureUtils.allOf( clusteringPlan.getInputGroups().stream() .map(inputGroup -> { -if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) { Review Comment: @yihua : this was intentionally kept it as false (default value). So only if user explicitly enabled row writer, we will enable row writer w/ clustering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Unable to alter column name for a Hudi table in AWS [hudi]
soumilshah1995 commented on issue #9780: URL: https://github.com/apache/hudi/issues/9780#issuecomment-1815417347 Found the code here is code https://github.com/soumilshah1995/code-snippets/blob/main/schema_evol_lab.ipynb Here is Video Guide https://www.youtube.com/watch?v=-_TFB0Gh3TY -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Unable to alter column name for a Hudi table in AWS [hudi]
soumilshah1995 commented on issue #9780: URL: https://github.com/apache/hudi/issues/9780#issuecomment-1815415276 let me search my code snip I know I have done this on AWS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6958] Simplify Out Of Box Schema Evolution Functionality - DOCS [hudi]
lokesh-lingarajan-0310 commented on code in PR #9881: URL: https://github.com/apache/hudi/pull/9881#discussion_r1396339333 ## website/docs/schema_evolution.md: ## @@ -22,21 +22,36 @@ the previous schema (e.g., renaming a column). Furthermore, the evolved schema is queryable across high-performance engines like Presto and Spark SQL without additional overhead for column ID translations or type reconciliations. The following table summarizes the schema changes compatible with different Hudi table types. -| Schema Change | COW | MOR | Remarks | -|:-|:-|:|:--| -| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. | -| Add a new nullable column to inner struct (at the end) | Yes | Yes | -| Add a new complex type field with default (map and array) | Yes | Yes | | -| Add a new nullable column and change the ordering of fields | No | No | Write succeeds but read fails if the write with evolved schema updated only some of the base files but not all. Currently, Hudi does not maintain a schema registry with history of changes across base files. Nevertheless, if the upsert touched all base files then the read will succeed. | -| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | | -| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec#Schema+Resolution). | -| Promote datatype from `int` to `long` for a nested field | Yes | Yes | -| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | | -| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. As a **workaround**, you can make the field nullable. | -| Add a new non-nullable column to inner struct (at the end) | No | No | | -| Change datatype from `long` to `int` for a nested field | No | No |