Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #10080:
URL: https://github.com/apache/hudi/pull/10080#discussion_r1396826530


##
website/docs/indexing.md:
##
@@ -155,4 +155,11 @@ to finally check the incoming updates against all files. 
The `SIMPLE` Index will
 `HBASE` index can be employed, if the operational overhead is acceptable and 
would provide much better lookup times for these tables.
 
 When using a global index, users should also consider setting 
`hoodie.bloom.index.update.partition.path=true` or 
`hoodie.simple.index.update.partition.path=true` to deal with cases where the
-partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
\ No newline at end of file
+partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
+
+
+## Related Resources
+Videos
+
+* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi 
Labs](https://youtu.be/XlRvMFJ7g9c)
+* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands 
on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY)

Review Comment:
   First one is okay. Lets move second over to there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #10080:
URL: https://github.com/apache/hudi/pull/10080#discussion_r1396826296


##
website/docs/indexing.md:
##
@@ -155,4 +155,11 @@ to finally check the incoming updates against all files. 
The `SIMPLE` Index will
 `HBASE` index can be employed, if the operational overhead is acceptable and 
would provide much better lookup times for these tables.
 
 When using a global index, users should also consider setting 
`hoodie.bloom.index.update.partition.path=true` or 
`hoodie.simple.index.update.partition.path=true` to deal with cases where the
-partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
\ No newline at end of file
+partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
+
+
+## Related Resources
+Videos
+
+* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi 
Labs](https://youtu.be/XlRvMFJ7g9c)
+* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands 
on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY)

Review Comment:
   This one is a good fit for the page - 
https://hudi.apache.org/docs/next/metadata_indexing 



##
website/docs/indexing.md:
##
@@ -155,4 +155,11 @@ to finally check the incoming updates against all files. 
The `SIMPLE` Index will
 `HBASE` index can be employed, if the operational overhead is acceptable and 
would provide much better lookup times for these tables.
 
 When using a global index, users should also consider setting 
`hoodie.bloom.index.update.partition.path=true` or 
`hoodie.simple.index.update.partition.path=true` to deal with cases where the
-partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
\ No newline at end of file
+partition path value could change due to an update e.g users table partitioned 
by home city; user relocates to a different city. These tables are also 
excellent candidates for the Merge-On-Read table type.
+
+
+## Related Resources
+Videos
+
+* [Global Bloom Index: Remove duplicates & guarantee uniquness - Hudi 
Labs](https://youtu.be/XlRvMFJ7g9c)
+* [Advantages of Metadata Indexing and Asynchronous Indexing in Hudi - Hands 
on Lab](https://www.youtube.com/watch?v=TSphQCsY4pY)

Review Comment:
   Lets move it over to there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


beyond1920 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396823348


##
.github/workflows/bot.yml:
##
@@ -377,15 +370,6 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark3.1'
-sparkRuntime: 'spark3.1.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


beyond1920 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396822953


##
.github/workflows/bot.yml:
##
@@ -302,15 +301,9 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   Done



##
.github/workflows/bot.yml:
##
@@ -377,15 +370,6 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   Done



##
.github/workflows/bot.yml:
##
@@ -302,15 +301,9 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark3.1'
-sparkRuntime: 'spark3.1.3'
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.0'
 sparkRuntime: 'spark3.0.2'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark2.4'

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #10080:
URL: https://github.com/apache/hudi/pull/10080#discussion_r1396823127


##
website/docs/key_generation.md:
##
@@ -209,3 +209,8 @@ Partition path generated from key generator: "2020040118"
 
 Input field value: "20200401" 
 Partition path generated from key generator: "04/01/2020"
+
+## Related Resources

Review Comment:
   This video might be relevant for de-duping, does not address key generation. 
Lets remove this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815889971

   
   ## CI report:
   
   * 47200120dd37ebaee77c583628bfddac1f564b3b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20967)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT]hudi insert is too slow & [hudi]

2023-11-16 Thread via GitHub


zyclove opened a new issue, #10131:
URL: https://github.com/apache/hudi/issues/10131

   
   
   **Describe the problem you faced**
   
   spark sql bulk insert data is too slow , how to turn performance.
   as https://hudi.apache.org/docs/performance 
   I do change many config, but is not well.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. spark-sql  set hudi config
   set 
hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt;
   set spark.sql.hive.filesourcePartitionFileCacheSize=524288000;
   set hoodie.metadata.table=false;
   set hoodie.sql.insert.mode=non-strict;
   set hoodie.sql.bulk.insert.enable=true;
   set hoodie.populate.meta.fields=false;
   set hoodie.parquet.compression.codec=snappy;
   
   set hoodie.bloom.index.prune.by.ranges=false;
   set hoodie.file.listing.parallelism=800;
   set hoodie.cleaner.parallelism=800;
   set hoodie.insert.shuffle.parallelism=800;
   set hoodie.upsert.shuffle.parallelism=800;
   set hoodie.delete.shuffle.parallelism=800;
   set hoodie.memory.compaction.max.size= 4294967296;
   set hoodie.memory.merge.max.size=107374182400;
   2. sql
   ```
   insert into bi_dw_real.dwd_smart_datapoint_report_rw_clear_rt 
   select
 /*+ coalesce(${partitions}) */
 
md5(concat(coalesce(data_id,''),coalesce(dev_id,''),coalesce(gw_id,''),coalesce(product_id,''),coalesce(uid,''),coalesce(dp_code,''),coalesce(dp_id,''),if(dp_mode
 in 
('ro','rw','wr'),dp_mode,'un'),coalesce(dp_name,''),coalesce(dp_time,''),coalesce(dp_type,''),coalesce(dp_value,''),coalesce(ct,'')))
 as id, 
 _hoodie_record_key as uuid,
 data_id,dev_id,gw_id,product_id,uid,
 dp_code,dp_id,if(dp_mode in ('ro','rw','wr'),dp_mode,'un') as dp_mode 
,dp_name,dp_time,dp_type,dp_value,
 ct as gmt_modified,
 case 
 when length(ct)=10 then 
date_format(from_unixtime(ct),'MMddHH')  
 when length(ct)=13 then 
date_format(from_unixtime(ct/1000),'MMddHH') 
 else '1970010100' end as dt
   from 
   
hudi_table_changes('bi_ods_real.ods_log_smart_datapoint_report_batch_rt', 
'latest_state', '${taskBeginTime}', '${next30minuteTime}')   
   lateral  view dataPointExplode(split(value,'\001')[0]) dps as ct, 
data_id, dev_id, gw_id, product_id, uid, dp_code, dp_id, gmtModified, dp_mode, 
dp_name, dp_time, dp_type, dp_value
   where _hoodie_commit_time >${taskBeginTime} and 
_hoodie_commit_time<=${next30minuteTime};
   ``` 
   3. result table info
   tblproperties (
 type = 'mor',
 primaryKey = 'id',
 preCombineField = 'gmt_modified',
 hoodie.combine.before.upsert='false',
 hoodie.bucket.index.num.buckets=128,
 hoodie.compact.inline='false',
 hoodie.common.spillable.diskmap.type='ROCKS_DB',
 hoodie.datasource.write.partitionpath.field='dt,dp_mode',
 
hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload'
)
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.14.0
   
   * Spark version :3.2.1
   
   * Hive version :3.2.1
   
   * Hadoop version :3.2.2
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   
![image](https://github.com/apache/hudi/assets/15028279/96c73e5a-d1f2-4db0-b583-37605ca754d0)
   
   
![image](https://github.com/apache/hudi/assets/15028279/1ce817b5-b372-4c09-a14c-f5ebf83f32fb)
   
   how to change parallelism? I set spark-sql --conf 
spark.default.parallelism=800 is not work .
   
   The follow config in sql file is not work as expect.
   
   ```
   set hoodie.file.listing.parallelism=800;
   set hoodie.cleaner.parallelism=800;
   set hoodie.insert.shuffle.parallelism=800;
   set hoodie.upsert.shuffle.parallelism=800;
   set hoodie.delete.shuffle.parallelism=800;
   ``` 
   
   
   The follow  issues are not bulk insert .
   #8189  #2620 
   Please take a look and give me some optimization suggestions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #10080:
URL: https://github.com/apache/hudi/pull/10080#discussion_r1396818699


##
website/docs/concurrency_control.md:
##
@@ -279,4 +279,10 @@ hoodie.cleaner.policy.failed.writes=EAGER
 ## Caveats
 
 If you are using the `WriteClient` API, please note that multiple writes to 
the table need to be initiated from 2 different instances of the write client. 
-It is **NOT** recommended to use the same instance of the write client to 
perform multi writing. 
\ No newline at end of file
+It is **NOT** recommended to use the same instance of the write client to 
perform multi writing. 
+
+## Related Resources
+Videos
+
+* [Efficient Data Ingestion with Glue Concurrency and Hudi Data 
Lake](https://www.youtube.com/watch?v=G_Xd1haycqE)

Review Comment:
   This might be just Glue concurrency. Not Hudi. We can remove the first one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Added video resources to Concepts and Services Sections [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #10080:
URL: https://github.com/apache/hudi/pull/10080#discussion_r1396818335


##
website/docs/compaction.md:
##
@@ -231,3 +231,5 @@ Offline compaction needs to submit the Flink task on the 
command line. The progr
 | `--seq` | `LIFO`  (Optional)   | The order in which compaction tasks are 
executed. Executing from the latest compaction plan by default. `LIFO`: 
executing from the latest plan. `FIFO`: executing from the oldest plan. |
 | `--service` | `false`  (Optional)  | Whether to start a monitoring service 
that checks and schedules new compaction task in configured interval. |
 | `--min-compaction-interval-seconds` | `600(s)` (optional)  | The checking 
interval for service mode, by default 10 minutes. |
+

Review Comment:
   Avoid whitespace changes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on PR #9952:
URL: https://github.com/apache/hudi/pull/9952#issuecomment-1815859946

   Can we also add an example for Incremental queries under Flink in 
sql_queries page for completeness?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396792988


##
website/docs/sql_dml.md:
##
@@ -199,7 +199,52 @@ You can control the behavior of these operations using 
various configuration opt
 
 ## Flink 
 
-Flink SQL also provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. All these operations are already 
-showcased in the [Flink Quickstart](/docs/flink-quick-start-guide).
+Flink SQL provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. These operations allow you to insert, update and 
delete data from your Hudi tables. Let's explore them one by one.
 
+### Insert Data
 
+You can utilize the INSERT INTO statement to incorporate data into a Hudi 
table using Flink SQL. Here are a few illustrative examples:
+
+```sql
+INSERT INTO  
+SELECT  FROM ;
+```
+
+Examples:
+
+```sql
+-- Insert into a Hudi table
+INSERT INTO hudi_table SELECT 1, 'a1', 20;
+```
+
+If the `write.operation` is 'upsert,' the INSERT INTO statement will not only 
insert new records but also update existing rows with the same record key.
+
+### Update Data
+With Flink SQL, you can use update command to update the hudi table. Here are 
a few illustrative examples:
+
+```sql
+UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE 
boolExpression]
+```
+
+```sql
+UPDATE hudi_table SET price = price * 2, ts =  WHERE id = 1;
+```
+
+:::note Key requirements
+Update query only work with batch excution mode.
+:::
+
+### Delete Data

Review Comment:
   Change `Delete Data` -> `Delete From` under both Spark SQL and Flink 
sections to be consistent with naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2023-11-16 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7117:
-

Assignee: Sagar Sumit

> Functional index creation not working when table is created using datasource 
> writer
> ---
>
> Key: HUDI-7117
> URL: https://issues.apache.org/jira/browse/HUDI-7117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Aditya Goenka
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue - 
> [https://github.com/apache/hudi/issues/10110]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-16 Thread via GitHub


majian1998 commented on code in PR #10120:
URL: https://github.com/apache/hudi/pull/10120#discussion_r1396791841


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hudi.avro.model.{BooleanWrapper, BytesWrapper, DateWrapper, 
DecimalWrapper, DoubleWrapper, FloatWrapper, HoodieMetadataColumnStats, 
IntWrapper, LongWrapper, StringWrapper, TimeMicrosWrapper, 
TimestampMicrosWrapper}
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.Supplier
+
+
+class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("file_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("column_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("min_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("max_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("null_num", DataTypes.LongType, nullable = true, 
Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+super.checkArgs(PARAMETERS, args)
+
+val table = getArgValueOrDefault(args, PARAMETERS(0))
+val targetColumns = getArgValueOrDefault(args, 
PARAMETERS(1)).getOrElse("").toString
+val targetColumnsSeq = targetColumns.split(",").toSeq
+val basePath = getBasePath(table)
+val metadataConfig = HoodieMetadataConfig.newBuilder
+  .enable(true)
+  .build
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val schemaUtil = new TableSchemaResolver(metaClient)
+var schema = 
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
+val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, 
metadataConfig, metaClient)
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = 
columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, false)
+
+val rows = new util.ArrayList[Row]
+colStatsRecords.collectAsList()
+  .stream()
+  .forEach(c => {
+rows.add(Row(c.getFileName, c.getColumnName, 
getColumnStatsValue(c.getMinValue), getColumnStatsValue(c.getMaxValue), 
c.getNullCount.longValue()))
+  })
+rows.stream().toArray().map(r => r.asInstanceOf[Row]).toList
+  }
+
+  def getColumnStatsValue(stats_value: Any): String = {
+stats_value match {
+  case _: IntWrapper |
+   _: BooleanWrapper |
+   _: BytesWrapper |

Review Comment:
   The support for this case has already been implemented, and now these two 
types will display the string of the array.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396791058


##
website/docs/sql_dml.md:
##
@@ -199,7 +199,52 @@ You can control the behavior of these operations using 
various configuration opt
 
 ## Flink 
 
-Flink SQL also provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. All these operations are already 
-showcased in the [Flink Quickstart](/docs/flink-quick-start-guide).
+Flink SQL provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. These operations allow you to insert, update and 
delete data from your Hudi tables. Let's explore them one by one.
 
+### Insert Data
 
+You can utilize the INSERT INTO statement to incorporate data into a Hudi 
table using Flink SQL. Here are a few illustrative examples:
+
+```sql
+INSERT INTO  
+SELECT  FROM ;
+```
+
+Examples:
+
+```sql
+-- Insert into a Hudi table
+INSERT INTO hudi_table SELECT 1, 'a1', 20;
+```
+
+If the `write.operation` is 'upsert,' the INSERT INTO statement will not only 
insert new records but also update existing rows with the same record key.
+
+### Update Data

Review Comment:
   Update Data to `Update` for consistency?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396790547


##
website/docs/sql_dml.md:
##
@@ -199,7 +199,52 @@ You can control the behavior of these operations using 
various configuration opt
 
 ## Flink 
 
-Flink SQL also provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. All these operations are already 
-showcased in the [Flink Quickstart](/docs/flink-quick-start-guide).
+Flink SQL provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. These operations allow you to insert, update and 
delete data from your Hudi tables. Let's explore them one by one.
 
+### Insert Data
 
+You can utilize the INSERT INTO statement to incorporate data into a Hudi 
table using Flink SQL. Here are a few illustrative examples:
+
+```sql
+INSERT INTO  
+SELECT  FROM ;
+```
+
+Examples:
+
+```sql
+-- Insert into a Hudi table
+INSERT INTO hudi_table SELECT 1, 'a1', 20;
+```
+
+If the `write.operation` is 'upsert,' the INSERT INTO statement will not only 
insert new records but also update existing rows with the same record key.

Review Comment:
   Add an example on how to set this option ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2023-11-16 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7117:
--
Epic Link: HUDI-512

> Functional index creation not working when table is created using datasource 
> writer
> ---
>
> Key: HUDI-7117
> URL: https://issues.apache.org/jira/browse/HUDI-7117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Aditya Goenka
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue - 
> [https://github.com/apache/hudi/issues/10110]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] RFC 63 Functional Index Hudi 0.1.0-beta [hudi]

2023-11-16 Thread via GitHub


ad1happy2go commented on issue #10110:
URL: https://github.com/apache/hudi/issues/10110#issuecomment-1815852059

   @soumilshah1995 @codope Create JIRA to track this issue - 
https://issues.apache.org/jira/browse/HUDI-7117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396788955


##
website/docs/sql_ddl.md:
##
@@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog
 The following is an example of creating a Flink table. Read the [Flink Quick 
Start](/docs/flink-quick-start-guide) guide for more examples.
 
 ```sql 
-CREATE TABLE hudi_table2(
+CREATE TABLE hudi_table(
   id int, 
   name string, 
   price double
 )
 WITH (
 'connector' = 'hudi',
-'path' = 's3://bucket-name/hudi/',
+'path' = 'file:///tmp/hudi_table',
 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, default 
is COPY_ON_WRITE
 );
 ```
 
+### Create partitioned table
+
+The following is an example of creating a Flink partitioned table.
+
+```sql 
+CREATE TABLE hudi_table(
+  id BIGINT,
+  name STRING,
+  dt STRING,
+  hh STRING
+)
+PARTITIONED BY (`dt`)
+WITH (
+'connector' = 'hudi',
+'path' = 'file:///tmp/hudi_table',
+'table.type' = 'MERGE_ON_READ'
+);
+```
+
+### Create table with record keys and ordering fields

Review Comment:
   Also consider adding a section on how to add Hudi specific configs to Flink 
similar to 'Setting Hudi configs' section in this page under Spark SQL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7117) Functional index creation not working when table is created using datasource writer

2023-11-16 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-7117:
---

 Summary: Functional index creation not working when table is 
created using datasource writer
 Key: HUDI-7117
 URL: https://issues.apache.org/jira/browse/HUDI-7117
 Project: Apache Hudi
  Issue Type: Bug
  Components: index
Reporter: Aditya Goenka
 Fix For: 1.0.0


Details and Reproducible code under Github Issue - 
[https://github.com/apache/hudi/issues/10110]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396786024


##
website/docs/sql_dml.md:
##
@@ -199,7 +199,52 @@ You can control the behavior of these operations using 
various configuration opt
 
 ## Flink 
 
-Flink SQL also provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. All these operations are already 
-showcased in the [Flink Quickstart](/docs/flink-quick-start-guide).
+Flink SQL provides several Data Manipulation Language (DML) actions for 
interacting with Hudi tables. These operations allow you to insert, update and 
delete data from your Hudi tables. Let's explore them one by one.
 
+### Insert Data

Review Comment:
   `Insert Into` for consistent naming with Spark SQL section ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10130:
URL: https://github.com/apache/hudi/pull/10130#issuecomment-1815846308

   
   ## CI report:
   
   * 42ed5726bd8cbfef2bed148ed6186034b11fd9eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20970)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815846194

   
   ## CI report:
   
   * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964)
 
   * 29b2f98a40bd73d7cde23f8cd845fec1c5c22dd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20969)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396784871


##
website/docs/sql_ddl.md:
##
@@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog
 The following is an example of creating a Flink table. Read the [Flink Quick 
Start](/docs/flink-quick-start-guide) guide for more examples.
 
 ```sql 
-CREATE TABLE hudi_table2(
+CREATE TABLE hudi_table(
   id int, 
   name string, 
   price double
 )
 WITH (
 'connector' = 'hudi',
-'path' = 's3://bucket-name/hudi/',
+'path' = 'file:///tmp/hudi_table',
 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, default 
is COPY_ON_WRITE
 );
 ```
 
+### Create partitioned table
+
+The following is an example of creating a Flink partitioned table.
+
+```sql 
+CREATE TABLE hudi_table(
+  id BIGINT,
+  name STRING,
+  dt STRING,
+  hh STRING
+)
+PARTITIONED BY (`dt`)
+WITH (
+'connector' = 'hudi',
+'path' = 'file:///tmp/hudi_table',
+'table.type' = 'MERGE_ON_READ'
+);
+```
+
+### Create table with record keys and ordering fields
+
+The following is an example of creating a Flink table with record key and 
ordering field similarly to spark.
+
+```sql 
+CREATE TABLE hudi_table(
+  id BIGINT PRIMARY KEY NOT ENFORCED,
+  name STRING,
+  dt STRING,
+  hh STRING
+)
+PARTITIONED BY (`dt`)
+WITH (
+'connector' = 'hudi',
+'path' = 'file:///tmp/hudi_table',
+'table.type' = 'MERGE_ON_READ',
+'precombine.field' = 'hh'
+);
+```
+
 ### Alter Table

Review Comment:
   Add syntax similar to the Spark SQL section ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396784495


##
website/docs/sql_ddl.md:
##
@@ -383,18 +383,57 @@ CREATE CATALOG hoodie_catalog
 The following is an example of creating a Flink table. Read the [Flink Quick 
Start](/docs/flink-quick-start-guide) guide for more examples.

Review Comment:
   Can we follow a similar pattern as the Spark SQL one with a syntax first and 
then followed by the examples? I believe there can be one syntax for the create 
table section followed by partitioned and non-partitioned sections.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10130:
URL: https://github.com/apache/hudi/pull/10130#issuecomment-1815839243

   
   ## CI report:
   
   * 42ed5726bd8cbfef2bed148ed6186034b11fd9eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815839107

   
   ## CI report:
   
   * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964)
 
   * 29b2f98a40bd73d7cde23f8cd845fec1c5c22dd1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396777966


##
website/docs/flink-quick-start-guide.md:
##
@@ -215,9 +215,9 @@ HoodiePipeline.Builder builder = 
HoodiePipeline.builder(targetTable)
 
 builder.sink(dataStream, false); // The second parameter indicating whether 
the input data stream is bounded
 env.execute("Api_Sink");
-
-// Full Quickstart Example - 
https://gist.github.com/ad1happy2go/1716e2e8aef6dcfe620792d6e6d86d36
 ```
+Refer Full Quickstart Example 
[here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamWriter.java)

Review Comment:
   Can we change the table's schema  to match the Flink SQL example ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396777280


##
website/docs/flink-quick-start-guide.md:
##
@@ -322,6 +359,20 @@ The `DELETE` statement is supported since Flink 1.17, so 
only Hudi Flink bundle
 Only **batch** queries on Hudi table with primary key work correctly.
 :::
 
+
+
+
+
+Creates a Flink Hudi table first and insert data into the Hudi table using 
DataStream API as below.
+When new rows with the same primary key and Row Kind as Delete arrive in 
stream, then it will be be deleted.
+
+Refer Delete Example 
[here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamReader.java)

Review Comment:
   Are the examples complete yet ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396776833


##
website/docs/flink-quick-start-guide.md:
##
@@ -302,7 +331,15 @@ Only **batch** queries on Hudi table with primary key work 
correctly.
 :::
 
 ## Delete Data {#deletes}
+
 
+
 ### Row-level Delete

Review Comment:
   Markdown Format nit: Leave space above and below to render heading style



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396776055


##
website/docs/flink-quick-start-guide.md:
##
@@ -287,10 +289,37 @@ Refers to [Table types and 
queries](/docs/concepts#table-types--queries) for mor
 
 This is similar to inserting new data.
 
+
+
+
+
+Creates a Flink Hudi table first and insert data into the Hudi table using SQL 
`VALUES` as below.
+
 ```sql
+-- Update Queries only works with batch execution mode
 SET 'execution.runtime-mode' = 'batch';
 UPDATE hudi_table SET fare = 25.0 WHERE uuid = 
'334e26e9-8355-45cc-97c6-c31daf0df330';
 ```
+
+
+
+
+Creates a Flink Hudi table first and insert data into the Hudi table using 
DataStream API as below.
+When new rows with the same primary key arrive in stream, then it will be be 
updated.
+In the insert example incoming row with same record id will be updated.
+
+Refer Update Example 
[here](https://github.com/ad1happy2go/hudi-examples/blob/main/flink/src/main/java/com/hudi/flink/quickstart/HudiDataStreamReader.java)
+
+
+
+
 
 Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the table for the first time.

Review Comment:
   Its not clear where the save mode is. Is this carried over from Spark 
quickstart by any chance ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Flink quickstart and sql website updates [hudi]

2023-11-16 Thread via GitHub


bhasudha commented on code in PR #9952:
URL: https://github.com/apache/hudi/pull/9952#discussion_r1396770283


##
website/docs/flink-quick-start-guide.md:
##
@@ -287,10 +289,37 @@ Refers to [Table types and 
queries](/docs/concepts#table-types--queries) for mor
 
 This is similar to inserting new data.
 
+
+
+
+
+Creates a Flink Hudi table first and insert data into the Hudi table using SQL 
`VALUES` as below.
+
 ```sql
+-- Update Queries only works with batch execution mode
 SET 'execution.runtime-mode' = 'batch';
 UPDATE hudi_table SET fare = 25.0 WHERE uuid = 
'334e26e9-8355-45cc-97c6-c31daf0df330';
 ```
+
+
+
+
+Creates a Flink Hudi table first and insert data into the Hudi table using 
DataStream API as below.

Review Comment:
   Instead of below -> link to the insert section of this doc? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on PR #10126:
URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815827193

   Thanks for the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396767580


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext 
context, boolean acquireLoc
 };
 this.timelineWriter.write(instantsToArchive, Option.of(action -> 
deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler));
 LOG.info("Deleting archived instants " + instantsToArchive);
-success = deleteArchivedInstants(instantsToArchive, context);
+deleteArchivedInstants(instantsToArchive, context);
 // triggers compaction and cleaning only after archiving action
 this.timelineWriter.compactAndClean(context);
   } else {
 LOG.info("No Instants to archive");
   }
-  return success;
+  return instantsToArchive.size();

Review Comment:
   We can address it in another PR, we kind of do not have atomicity for 
current 2 steps:
   
   - flush of archived timeline
   - deletion of active metadata files
   
   My roughly thought is we can fix the left over active metadata files (should 
be deleted from active timeline) in the next round of archiving, imagine the 
latest instant time in archived timeline is t10 and the oldest instant in 
active timeline is t7, we should retry the deletion of instant metadat files 
from t7 ~ 10 at the very beginning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396767580


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext 
context, boolean acquireLoc
 };
 this.timelineWriter.write(instantsToArchive, Option.of(action -> 
deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler));
 LOG.info("Deleting archived instants " + instantsToArchive);
-success = deleteArchivedInstants(instantsToArchive, context);
+deleteArchivedInstants(instantsToArchive, context);
 // triggers compaction and cleaning only after archiving action
 this.timelineWriter.compactAndClean(context);
   } else {
 LOG.info("No Instants to archive");
   }
-  return success;
+  return instantsToArchive.size();

Review Comment:
   We can address it in another PR, we kind of do not have atomicity for 
current 2 steps:
   
   - flush of archived timeline
   - deletion of active metadata files
   
   My roughly thought about it, we can fix the left over active metadata files 
(should be deleted from active timeline) in the next round of archiving, 
imagine the latest instant time in archived timeline is t10 and the oldest 
instant in active timeline is t7, we should retry the deletion of instant 
metadat files from t7 ~ 10 at the very beginning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


yihua commented on PR #10126:
URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815823664

   The upload is done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396761804


##
.github/workflows/bot.yml:
##
@@ -377,15 +370,6 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark3.1'
-sparkRuntime: 'spark3.1.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   We have flink1.14.6 and spark 2.4.8 doker image now, can just upgrade to 
flink1.14.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396761419


##
.github/workflows/bot.yml:
##
@@ -302,15 +301,9 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark3.1'
-sparkRuntime: 'spark3.1.3'
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.0'
 sparkRuntime: 'spark3.0.2'
-  - flinkProfile: 'flink1.13'
-sparkProfile: 'spark2.4'

Review Comment:
   We have flink1.14.6 and spark 2.4.8 doker image now, can just upgrade to 
flink1.14.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


danny0405 merged PR #10126:
URL: https://github.com/apache/hudi/pull/10126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7111] Fix performance regression of tag when written into simple bucket index table [hudi]

2023-11-16 Thread via GitHub


beyond1920 opened a new pull request, #10130:
URL: https://github.com/apache/hudi/pull/10130

   ### Change Logs
   
   After upgrade the version to 0.14.0, the performance of the Spark job, which 
is written into a simple bucket index table, is regressing.
   
![image](https://github.com/apache/hudi/assets/1525333/1652062b-3a96-4ed1-943a-9040ea7737ce)
   
   The reason is in the [PR#4480](https://github.com/apache/hudi/pull/4480), 
the refactor of bucket index introduce two unnecessary stages in tag for simple 
bucket index.
   `List partitions = 
records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();`
   
   ### Impact
   
   NA
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7111) Performance regression of spark job which written into simple bucket index table

2023-11-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7111:
-
Labels: pull-request-available  (was: )

> Performance regression of spark job which written into simple bucket index 
> table
> 
>
> Key: HUDI-7111
> URL: https://issues.apache.org/jira/browse/HUDI-7111
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-11-16-23-41-32-729.png
>
>
> After upgrade the version to 0.14.0, the performance of the Spark job, which 
> is written into a simple bucket index table, is regressing.
>  !image-2023-11-16-23-41-32-729.png! 
> The reason is in the [PR#4480|https://github.com/apache/hudi/pull/4480], the 
> refactor of bucket index introduce two unnecessary stages in tag for simple 
> bucket index.
> {code:java}
> List partitions = 
> records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


yihua commented on PR #10126:
URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815805457

   The docker image 
`apachehudi/hudi-ci-bundle-validation-base:flink1146hive239spark248` is being 
uploaded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


yihua commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r1396743948


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/HoodieBigQuerySyncClient.java:
##
@@ -51,34 +51,41 @@
 import java.util.Map;
 import java.util.stream.Collectors;
 
+import static 
org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID;
 import static 
org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_DATASET_LOCATION;
 import static 
org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_DATASET_NAME;
 import static 
org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_PROJECT_ID;
+import static 
org.apache.hudi.gcp.bigquery.BigQuerySyncConfig.BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER;
 
 public class HoodieBigQuerySyncClient extends HoodieSyncClient {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(HoodieBigQuerySyncClient.class);
 
   protected final BigQuerySyncConfig config;
   private final String projectId;
+  private final String bigLakeConnectionId;
   private final String datasetName;
+  private final boolean requirePartitionFilter;
   private transient BigQuery bigquery;
 
   public HoodieBigQuerySyncClient(final BigQuerySyncConfig config) {
 super(config);
 this.config = config;
 this.projectId = config.getString(BIGQUERY_SYNC_PROJECT_ID);
+this.bigLakeConnectionId = 
config.getString(BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID);
 this.datasetName = config.getString(BIGQUERY_SYNC_DATASET_NAME);
+this.requirePartitionFilter = 
config.getBoolean(BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER);
 this.createBigQueryConnection();
   }
 
-  @VisibleForTesting

Review Comment:
   not sure if this need to be kept.  Leave it to you to decide.



##
hudi-gcp/pom.xml:
##
@@ -70,7 +70,6 @@ See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google
 
   com.google.cloud
   google-cloud-pubsub
-  ${google.cloud.pubsub.version}

Review Comment:
   Got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


yihua commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r1396743251


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .markAdvanced()
   .withDocumentation("Fetch file listing from Hudi's metadata");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.require_partition_filter")
+  .defaultValue(false)
+  .withDocumentation("If true, configure table to require a partition 
filter to be specified when querying the table");
+
+  public static final ConfigProperty 
BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty
+  .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id")
+  .noDefaultValue()
+  .withDocumentation("The Big Lake connection ID to use");
+

Review Comment:
   yeah



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815793435

   
   ## CI report:
   
   * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722)
 
   * 47200120dd37ebaee77c583628bfddac1f564b3b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20967)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815788365

   
   ## CI report:
   
   * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20965)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1815788149

   
   ## CI report:
   
   * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722)
 
   * 47200120dd37ebaee77c583628bfddac1f564b3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10126:
URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815754278

   
   ## CI report:
   
   * 35fa17181c858c202869a4f9f7807ccb37c83438 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] DOCS-updated community sync page for new details [hudi]

2023-11-16 Thread via GitHub


nfarah86 opened a new pull request, #10129:
URL: https://github.com/apache/hudi/pull/10129

   
   @bhasudha  updated the community sync page to point to the LinkedIn live 
events page
   
   https://github.com/apache/hudi/assets/5392555/0eb42563-38f1-4fe4-98a8-a83d0ec2d2ab;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10126:
URL: https://github.com/apache/hudi/pull/10126#issuecomment-1815749391

   
   ## CI report:
   
   * 35fa17181c858c202869a4f9f7807ccb37c83438 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Hudi Sink Connector is not working with Version 0.14.0 and 0.13.1 [hudi]

2023-11-16 Thread via GitHub


seethb opened a new issue, #10128:
URL: https://github.com/apache/hudi/issues/10128

   Hi I have followed Hudi kafka connect instructions from this document 
https://github.com/apache/hudi/blob/master/hudi-kafka-connect/README.md and 
trying to setup a Hudi Sink connector in my local environment. 
   
   after submitted connector config (shown below) and getting below error.
   curl -i -X POST -H "Accept:application/json" -H 
"Content-Type:application/json" localhost:8083/connectors/ -d '{
   "name": "hudi-kafka-test",
   "config": {
   "bootstrap.servers": "172.18.0.5:9092",
   "connector.class": 
"org.apache.hudi.connect.HoodieSinkConnector",
   "tasks.max": "1",
   "topics": "hudi-test-topic",
   "hoodie.table.name": "hudi-test-topic",
   "key.converter": 
"org.apache.kafka.connect.storage.StringConverter",
   "value.converter": 
"org.apache.kafka.connect.storage.StringConverter",
   "value.converter.schemas.enable": "false",
   "hoodie.table.type": "MERGE_ON_READ",
   "hoodie.base.path": "file:///tmp/hoodie/hudi-test-topic",
   "hoodie.datasource.write.recordkey.field": "volume",
   "hoodie.datasource.write.partitionpath.field": "date",
   "hoodie.schemaprovider.class": 
"org.apache.hudi.schema.SchemaRegistryProvider",
   "hoodie.deltastreamer.schemaprovider.registry.url": 
"http://172.18.0.7:8081/subjects/hudi-test-topic-value/versions/latest;,
   "hoodie.kafka.commit.interval.secs": 60
 }
   }'
   
   SubscriptionState:399)
   [2023-11-15 11:05:12,065] INFO [hudi-kafka-test|task-0] New partitions added 
[hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:150)
   [2023-11-15 11:05:12,065] INFO [hudi-kafka-test|task-0] Bootstrap task for 
connector hudi-kafka-test with id null with assignments [hudi-test-topic-0] 
part [hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:185)
   [2023-11-15 11:05:12,067] INFO [hudi-kafka-test|task-0] Existing partitions 
deleted [hudi-test-topic-0] (org.apache.hudi.connect.HoodieSinkTask:156)
   [2023-11-15 11:05:12,067] ERROR [hudi-kafka-test|task-0] 
WorkerSinkTask{id=hudi-kafka-test-0} Task threw an uncaught and unrecoverable 
exception. Task is being killed and will not recover until manually restarted 
(org.apache.kafka.connect.runtime.WorkerTask:195)
   java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at 
org.apache.hudi.config.metrics.HoodieMetricsConfig.lambda$static$0(HoodieMetricsConfig.java:72)
at 
org.apache.hudi.common.config.HoodieConfig.setDefaultValue(HoodieConfig.java:86)
at 
org.apache.hudi.common.config.HoodieConfig.lambda$setDefaults$2(HoodieConfig.java:139)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at 
org.apache.hudi.common.config.HoodieConfig.setDefaults(HoodieConfig.java:135)
at 
org.apache.hudi.config.metrics.HoodieMetricsConfig.access$100(HoodieMetricsConfig.java:44)
at 
org.apache.hudi.config.metrics.HoodieMetricsConfig$Builder.build(HoodieMetricsConfig.java:185)
at 
org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:2937)
at 
org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3052)
at 
org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3047)
at 
org.apache.hudi.connect.writers.KafkaConnectTransactionServices.(KafkaConnectTransactionServices.java:79)
at 
org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.(ConnectTransactionCoordinator.java:88)
at 
org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:637)
at 
org.apache.kafka.connect.runtime.WorkerSinkTask.access$1200(WorkerSinkTask.java:72)
at 

[I] [SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table [hudi]

2023-11-16 Thread via GitHub


shubhamn21 opened a new issue, #10127:
URL: https://github.com/apache/hudi/issues/10127

   **Describe the problem you faced**
   
   The hudi job runs fine for an hour but then crashes after a Warning about 
`Clean Action failure` and subsequently raising an exception 
   `org.apache.hudi.exception.HoodieIOException: Could not check if 
s3a://xyz-bucket/spark-warehouse/xyz-table-name/.hoodie/metadata is a valid 
table `
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version : 3.3.0
   
   * Java version : 1.8 
   
   * Storage (HDFS/S3/GCS..) : S3A bucket
   
   * Running on Docker? (yes/no) : Yes, Spark on kubernetes
   
   
   **Stacktrace**
   
   ```23/11/17 04:00:55 INFO HoodieTableMetaClient: Loading 
HoodieTableMetaClient from 
s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata
   23/11/17 04:00:55 WARN CleanActionExecutor: Failed to perform previous clean 
operation, instant: [==>20231117040045263__clean__REQUESTED]
   org.apache.hudi.exception.HoodieIOException: Could not check if 
s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata is a valid 
table
at 
org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.(HoodieTableMetaClient.java:137)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
at 
org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
at 
org.apache.hudi.common.table.HoodieTableMetaClient.reload(HoodieTableMetaClient.java:174)
at 
org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:153)
at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:838)
at 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:918)
at 
org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$1(BaseActionExecutor.java:68)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
at 
org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:68)
at 
org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:221)
at 
org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:187)
at 
org.apache.hudi.table.action.clean.CleanActionExecutor.lambda$execute$8(CleanActionExecutor.java:256)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at 
org.apache.hudi.table.action.clean.CleanActionExecutor.execute(CleanActionExecutor.java:250)
at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.clean(HoodieSparkCopyOnWriteTable.java:263)
at 
org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:557)
at 
org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:759)
at 
org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:731)
at 
org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55)
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.InterruptedIOException: getFileStatus on 
s3a://xyz-bucket/spark-warehouse/xyz_table_name/.hoodie/metadata/.hoodie: 
com.amazonaws.AbortedException: 
at 
org.apache.hadoop.fs.s3a.S3AUtils.translateInterruptedException(S3AUtils.java:395)
at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:201)
at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3799)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)
at 

[jira] [Updated] (HUDI-7116) Add docker image for flink 1.14 and spark 2.4.8

2023-11-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7116:
-
Labels: pull-request-available  (was: )

> Add docker image for flink 1.14 and spark 2.4.8
> ---
>
> Key: HUDI-7116
> URL: https://issues.apache.org/jira/browse/HUDI-7116
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compile
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7116] Add docker image for flink 1.14 and spark 2.4.8 [hudi]

2023-11-16 Thread via GitHub


danny0405 opened a new pull request, #10126:
URL: https://github.com/apache/hudi/pull/10126

   ### Change Logs
   
   Add the build image for flink 1.14 and spark 2.4.8.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7116) Add docker image for flink 1.14 and spark 2.4.8

2023-11-16 Thread Danny Chen (Jira)
Danny Chen created HUDI-7116:


 Summary: Add docker image for flink 1.14 and spark 2.4.8
 Key: HUDI-7116
 URL: https://issues.apache.org/jira/browse/HUDI-7116
 Project: Apache Hudi
  Issue Type: Improvement
  Components: compile
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396685712


##
.github/workflows/bot.yml:
##
@@ -377,15 +370,6 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade to 
flink1.14.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7072] Remove support for Flink 1.13 [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10052:
URL: https://github.com/apache/hudi/pull/10052#discussion_r1396685357


##
.github/workflows/bot.yml:
##
@@ -302,15 +301,9 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade the 
flink to flink1.14.



##
.github/workflows/bot.yml:
##
@@ -302,15 +301,9 @@ jobs:
   - flinkProfile: 'flink1.14'
 sparkProfile: 'spark3.2'
 sparkRuntime: 'spark3.2.3'
-  - flinkProfile: 'flink1.13'

Review Comment:
   We have flink1.14.6 and spark 3.1.3 doker image now, can just upgrade to 
flink1.14.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815718739

   
   ## CI report:
   
   * 1877ef99bb4c939181b5341e19c44cbba742d7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20957)
 
   * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20965)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on issue #10107:
URL: https://github.com/apache/hudi/issues/10107#issuecomment-1815714178

   We did have some fixes in recent releases: 
https://github.com/apache/hudi/pull/7568, 
https://github.com/apache/hudi/pull/8443.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10122:
URL: https://github.com/apache/hudi/pull/10122#issuecomment-1815713303

   
   ## CI report:
   
   * 1877ef99bb4c939181b5341e19c44cbba742d7cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20957)
 
   * faf61fb4c40584fd9dbdd4aafc85e699c3d9d8ba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Build failed using master [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on PR #9726:
URL: https://github.com/apache/hudi/pull/9726#issuecomment-1815711729

   If we are talking about the version coflicts between the explcity introduced 
jackson jar and the parquet jar, it is actually a bug, should have a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7112] Reuse existing timeline server and performance improvements [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10122:
URL: https://github.com/apache/hudi/pull/10122#discussion_r1396674754


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineServerHelper.java:
##
@@ -23,66 +23,42 @@
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.config.HoodieWriteConfig;
 
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
 import java.io.IOException;
 
 /**
  * Helper class to instantiate embedded timeline service.
  */
 public class EmbeddedTimelineServerHelper {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(EmbeddedTimelineService.class);
-
-  private static Option TIMELINE_SERVER = 
Option.empty();
-
   /**
* Instantiate Embedded Timeline Server.
* @param context Hoodie Engine Context
* @param config Hoodie Write Config
* @return TimelineServer if configured to run
* @throws IOException
*/
-  public static synchronized Option 
createEmbeddedTimelineService(
+  public static Option createEmbeddedTimelineService(
   HoodieEngineContext context, HoodieWriteConfig config) throws 
IOException {
-if (config.isEmbeddedTimelineServerReuseEnabled()) {
-  if (!TIMELINE_SERVER.isPresent() || 
!TIMELINE_SERVER.get().canReuseFor(config.getBasePath())) {
-TIMELINE_SERVER = Option.of(startTimelineService(context, config));
-  } else {
-updateWriteConfigWithTimelineServer(TIMELINE_SERVER.get(), config);
-  }
-  return TIMELINE_SERVER;
-}
 if (config.isEmbeddedTimelineServerEnabled()) {
-  return Option.of(startTimelineService(context, config));
+  Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
+  EmbeddedTimelineService timelineService = 
EmbeddedTimelineService.getOrStartEmbeddedTimelineService(context, 
hostAddr.orElse(null), config);
+  updateWriteConfigWithTimelineServer(timelineService, config);

Review Comment:
   So a single Spark Driver has chance to be shared by multiple writers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815708138

   
   ## CI report:
   
   * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


majian1998 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396670883


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext 
context, boolean acquireLoc
 };
 this.timelineWriter.write(instantsToArchive, Option.of(action -> 
deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler));
 LOG.info("Deleting archived instants " + instantsToArchive);
-success = deleteArchivedInstants(instantsToArchive, context);
+deleteArchivedInstants(instantsToArchive, context);
 // triggers compaction and cleaning only after archiving action
 this.timelineWriter.compactAndClean(context);
   } else {
 LOG.info("No Instants to archive");
   }
-  return success;
+  return instantsToArchive.size();

Review Comment:
   In the current archive code, it seems that `success` is always set to true. 
The `success` variable is initialized as true, and `deleteArchivedInstants` 
always returns true unless it fails. However, if there is a failure, the 
current implementation should throw an exception and terminate without 
returning any value. Therefore, I believe the `success` variable is meaningless 
in the current context. Alternatively, should we catch these exceptions and 
return false? I'm not sure if this would be reasonable.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396663741


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -111,13 +114,13 @@ public boolean archiveIfRequired(HoodieEngineContext 
context, boolean acquireLoc
 };
 this.timelineWriter.write(instantsToArchive, Option.of(action -> 
deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler));
 LOG.info("Deleting archived instants " + instantsToArchive);
-success = deleteArchivedInstants(instantsToArchive, context);
+deleteArchivedInstants(instantsToArchive, context);
 // triggers compaction and cleaning only after archiving action
 this.timelineWriter.compactAndClean(context);
   } else {
 LOG.info("No Instants to archive");
   }
-  return success;
+  return instantsToArchive.size();

Review Comment:
   Should we also include the `success` flag also as a metric?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815676361

   
   ## CI report:
   
   * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942)
 
   * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20964)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Query failure due to replacecommit being archived [hudi]

2023-11-16 Thread via GitHub


haoxie-aws commented on issue #10107:
URL: https://github.com/apache/hudi/issues/10107#issuecomment-1815675385

   I find this query, `select * from samplehudi as t1, samplehudi as t2, 
samplehudi as t3 where t1.key=t2.key and t1.key=t3.key limit 1`, is 
slightly easier to replicate the issue. And I have managed to replicate it with 
a Hudi 0.12.2 writer as recommended by @ad1happy2go . 
   
   I know Hudi uses `CLEAN_RETAIN_COMMITS` to prevent parquet files being 
cleaned while they are used by query, but is there a similar mechanism for 
timeline files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Build failed using master [hudi]

2023-11-16 Thread via GitHub


Forus0322 commented on PR #9726:
URL: https://github.com/apache/hudi/pull/9726#issuecomment-1815673222

   @codope Hi, I reanalyzed the problem. The essence is that parquet 1.10.1 
contains the org.codehaus.jackson dependency package, and parquet 1.12.0 
version does not contain the org.codehaus.jackson dependency package.
   
   Secondly, the JsonEncoder class under the hudi-common-base package uses the 
org.codehaus.jackson dependency package, and hudi-common-base references the 
parquet-avro dependency package, so when compiling with parquet.version 1.10.1, 
There is no problem; but when using parquet.version 1.12.0, there will be a 
missing JsonEncoder dependency problem.
   
   Therefore, this issue does not need to be included in the current PR, just 
specify parquet.version 1.10.1.
   
   But if you want to use parquet.version 1.12.0, you need to add an additional 
org.codehaus.jackson dependency.
   
   The current PR can modify the profile, not hadoop3.3, but add the 
org.codehaus.jackson dependency to any profile using version 1.12.0.
   
   Do you think it is necessary to modify it again?
   
   
![image](https://github.com/apache/hudi/assets/70357858/c884d3fa-f1dc-401e-bd8d-5a8d2cd30698)
   
![image](https://github.com/apache/hudi/assets/70357858/c83666a6-0be5-45e0-a9cf-b034338791ef)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


stream2000 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396632832


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiver.java:
##
@@ -256,8 +256,8 @@ public void testArchiveEmptyTable() throws Exception {
 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieTable table = HoodieSparkTable.create(cfg, context, metaClient);
 HoodieTimelineArchiver archiver = new HoodieTimelineArchiver(cfg, table);
-boolean result = archiver.archiveIfRequired(context);
-assertTrue(result);
+int result = archiver.archiveIfRequired(context);

Review Comment:
   We can add some tests for archive metrics like what we done for compaction 
metrics in #8759.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815670631

   
   ## CI report:
   
   * 178ef4eadac6ab6d009d86ab86d35babe952 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20942)
 
   * 3a4f9e92cd71dac2506c883c57785f07ee5bcf24 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10125:
URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815666263

   
   ## CI report:
   
   * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN
   * f2f380dec7f0afa5fd7fb0accbe8c17e22853f00 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20963)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes [hudi]

2023-11-16 Thread via GitHub


zyclove commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-1815661842

   I also encountered the same problem with 0.14.0, how to solve it?
   disable metadata ?
   set hoodie.metadata.table=false;
   
   change hoodie.parquet.small.file.limit ?
   
   set hoodie.bloom.index.prune.by.ranges = false ?
   
   change hoodie.memory.merge.max.size ?
   
   Can this be optimized in hudi 1.0? This stage is simply too time consuming.
   
   
![image](https://github.com/apache/hudi/assets/15028279/090b9a46-94fc-4d8c-9ea8-041437632761)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


majian1998 commented on PR #10101:
URL: https://github.com/apache/hudi/pull/10101#issuecomment-1815653571

   Now, the archive metrics have been moved to `BaseHoodieTableServiceClient`, 
and the return value of archiveIfRequired has been modified. In the previous 
implementation, the `success` variable always seemed to be true? So, I have 
also removed the relevant checks in the unit tests. cc@danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Disproportionately Slow performance during "Building workload profile" phase [hudi]

2023-11-16 Thread via GitHub


zyclove commented on issue #8189:
URL: https://github.com/apache/hudi/issues/8189#issuecomment-1815649576

   
![image](https://github.com/apache/hudi/assets/15028279/49829897-5dbc-4a98-bde9-2ded448681a2)
   
   
![image](https://github.com/apache/hudi/assets/15028279/8753d412-d06b-475a-a106-823e785fd96b)
   
   What do these two stages do? It keeps getting stuck, which is very 
time-consuming. Can you give me some optimization suggestions? Thank you.
   
   hudi 0.14.0 
   spark 3.2.1
   @nsivabalan @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7110] Add call procedure for show column stats information [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10120:
URL: https://github.com/apache/hudi/pull/10120#discussion_r1396587932


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowMetadataTableColumnStatsProcedure.scala:
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hudi.avro.model.{BooleanWrapper, BytesWrapper, DateWrapper, 
DecimalWrapper, DoubleWrapper, FloatWrapper, HoodieMetadataColumnStats, 
IntWrapper, LongWrapper, StringWrapper, TimeMicrosWrapper, 
TimestampMicrosWrapper}
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.Supplier
+
+
+class ShowMetadataTableColumnStatsProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("file_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("column_name", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("min_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("max_value", DataTypes.StringType, nullable = true, 
Metadata.empty),
+StructField("null_num", DataTypes.LongType, nullable = true, 
Metadata.empty)
+  ))
+
+  def parameters: Array[ProcedureParameter] = PARAMETERS
+
+  def outputType: StructType = OUTPUT_TYPE
+
+  override def call(args: ProcedureArgs): Seq[Row] = {
+super.checkArgs(PARAMETERS, args)
+
+val table = getArgValueOrDefault(args, PARAMETERS(0))
+val targetColumns = getArgValueOrDefault(args, 
PARAMETERS(1)).getOrElse("").toString
+val targetColumnsSeq = targetColumns.split(",").toSeq
+val basePath = getBasePath(table)
+val metadataConfig = HoodieMetadataConfig.newBuilder
+  .enable(true)
+  .build
+val metaClient = 
HoodieTableMetaClient.builder.setConf(jsc.hadoopConfiguration()).setBasePath(basePath).build
+val schemaUtil = new TableSchemaResolver(metaClient)
+var schema = 
AvroConversionUtils.convertAvroSchemaToStructType(schemaUtil.getTableAvroSchema)
+val columnStatsIndex = new ColumnStatsIndexSupport(spark, schema, 
metadataConfig, metaClient)
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = 
columnStatsIndex.loadColumnStatsIndexRecords(targetColumnsSeq, false)
+
+val rows = new util.ArrayList[Row]
+colStatsRecords.collectAsList()
+  .stream()
+  .forEach(c => {
+rows.add(Row(c.getFileName, c.getColumnName, 
getColumnStatsValue(c.getMinValue), getColumnStatsValue(c.getMaxValue), 
c.getNullCount.longValue()))
+  })
+rows.stream().toArray().map(r => r.asInstanceOf[Row]).toList
+  }
+
+  def getColumnStatsValue(stats_value: Any): String = {
+stats_value match {
+  case _: IntWrapper |
+   _: BooleanWrapper |
+   _: BytesWrapper |

Review Comment:
   `java.nio.HeapByteBuffer[pos=0 lim=11 cap=11]` looks not very clear



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10125:
URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815627236

   
   ## CI report:
   
   * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961)
 
   * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN
   * f2f380dec7f0afa5fd7fb0accbe8c17e22853f00 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10125:
URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815621023

   
   ## CI report:
   
   * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961)
 
   * d94d74a02df88f3ca32807c7f580900b268ca0d0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7109) Fix Flink may re-use a committed instant in append mode

2023-11-16 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7109:
-
Fix Version/s: 0.14.1
   1.0.0

> Fix Flink may re-use a committed instant in append mode
> ---
>
> Key: HUDI-7109
> URL: https://issues.apache.org/jira/browse/HUDI-7109
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7109) Fix Flink may re-use a committed instant in append mode

2023-11-16 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7109.

Resolution: Fixed

Fixed via master branch: 3d0c4501d6b7062f38b7755f71660818cd95c1f6

> Fix Flink may re-use a committed instant in append mode
> ---
>
> Key: HUDI-7109
> URL: https://issues.apache.org/jira/browse/HUDI-7109
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (f06ff5b3e0e -> 3d0c4501d6b)

2023-11-16 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from f06ff5b3e0e [HUDI-7090] Set the maxParallelism for singleton operator  
(#10090)
 add 3d0c4501d6b [HUDI-7109] Fix Flink may re-use a committed instant in 
append mode (#10119)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/sink/append/AppendWriteFunction.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: [PR] [HUDI-7109] Fix Flink may re-use a committed instant in append mode [hudi]

2023-11-16 Thread via GitHub


danny0405 merged PR #10119:
URL: https://github.com/apache/hudi/pull/10119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7090) Set maxParallelism for singleton operator ,for example compact_plan_generate、split_monitor、compact_commit

2023-11-16 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7090.

Resolution: Fixed

Fixed via master branch: f06ff5b3e0ee8bb6e49aad04d3b6054d6c46e272

> Set maxParallelism for singleton operator ,for example  
> compact_plan_generate、split_monitor、compact_commit 
> ---
>
> Key: HUDI-7090
> URL: https://issues.apache.org/jira/browse/HUDI-7090
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>
> The parallelism of singleton operator should be strictly controlled,  which 
> can avoid some problems caused by modifying parallelism through other means.
> For example:
> if set the split_monitor operator parallelism to 10 , it will read data 10 
> times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7090) Set maxParallelism for singleton operator ,for example compact_plan_generate、split_monitor、compact_commit

2023-11-16 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7090:
-
Fix Version/s: 0.14.1
   1.0.0

> Set maxParallelism for singleton operator ,for example  
> compact_plan_generate、split_monitor、compact_commit 
> ---
>
> Key: HUDI-7090
> URL: https://issues.apache.org/jira/browse/HUDI-7090
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: hehuiyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1, 1.0.0
>
>
> The parallelism of singleton operator should be strictly controlled,  which 
> can avoid some problems caused by modifying parallelism through other means.
> For example:
> if set the split_monitor operator parallelism to 10 , it will read data 10 
> times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7090] Set the maxParallelism for singleton operator (#10090)

2023-11-16 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f06ff5b3e0e [HUDI-7090] Set the maxParallelism for singleton operator  
(#10090)
f06ff5b3e0e is described below

commit f06ff5b3e0ee8bb6e49aad04d3b6054d6c46e272
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Fri Nov 17 09:43:21 2023 +0800

[HUDI-7090] Set the maxParallelism for singleton operator  (#10090)
---
 .../hudi/sink/clustering/HoodieFlinkClusteringJob.java |  4 +++-
 .../org/apache/hudi/sink/compact/HoodieFlinkCompactor.java |  4 +++-
 .../main/java/org/apache/hudi/sink/utils/Pipelines.java| 14 +++---
 .../main/java/org/apache/hudi/table/HoodieTableSource.java |  1 +
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java
index a453cac6803..0966f6995bd 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java
@@ -348,7 +348,9 @@ public class HoodieFlinkClusteringJob {
   .addSink(new ClusteringCommitSink(conf))
   .name("clustering_commit")
   .uid("uid_clustering_commit")
-  .setParallelism(1);
+  .setParallelism(1)
+  .getTransformation()
+  .setMaxParallelism(1);
 
   env.execute("flink_hudi_clustering_" + clusteringInstant.getTimestamp());
 }
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java
index 57e823ab21c..99dd45d94b4 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/HoodieFlinkCompactor.java
@@ -298,7 +298,9 @@ public class HoodieFlinkCompactor {
   .addSink(new CompactionCommitSink(conf))
   .name("compaction_commit")
   .uid("uid_compaction_commit")
-  .setParallelism(1);
+  .setParallelism(1)
+  .getTransformation()
+  .setMaxParallelism(1);
 
   env.execute("flink_hudi_compaction_" + String.join(",", 
compactionInstantTimes));
 }
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
index e66009aa551..b3acd4cfa11 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
@@ -410,10 +410,11 @@ public class Pipelines {
* @return the compaction pipeline
*/
   public static DataStreamSink compact(Configuration 
conf, DataStream dataStream) {
-return dataStream.transform("compact_plan_generate",
+DataStreamSink compactionCommitEventDataStream = 
dataStream.transform("compact_plan_generate",
 TypeInformation.of(CompactionPlanEvent.class),
 new CompactionPlanOperator(conf))
 .setParallelism(1) // plan generate must be singleton
+.setMaxParallelism(1)
 // make the distribution strategy deterministic to avoid concurrent 
modifications
 // on the same bucket files
 .keyBy(plan -> plan.getOperation().getFileGroupId().getFileId())
@@ -424,6 +425,8 @@ public class Pipelines {
 .addSink(new CompactionCommitSink(conf))
 .name("compact_commit")
 .setParallelism(1); // compaction commit should be singleton
+compactionCommitEventDataStream.getTransformation().setMaxParallelism(1);
+return compactionCommitEventDataStream;
   }
 
   /**
@@ -452,6 +455,7 @@ public class Pipelines {
 TypeInformation.of(ClusteringPlanEvent.class),
 new ClusteringPlanOperator(conf))
 .setParallelism(1) // plan generate must be singleton
+.setMaxParallelism(1) // plan generate must be singleton
 .keyBy(plan ->
 // make the distribution strategy deterministic to avoid 
concurrent modifications
 // on the same bucket files
@@ -465,15 +469,19 @@ public class Pipelines {
   ExecNodeUtil.setManagedMemoryWeight(clusteringStream.getTransformation(),
   conf.getInteger(FlinkOptions.WRITE_SORT_MEMORY) * 1024L * 1024L);
 }
-return clusteringStream.addSink(new ClusteringCommitSink(conf))
+DataStreamSink clusteringCommitEventDataStream = 

Re: [PR] [HUDI-7090]Set the maxParallelism for singleton operator [hudi]

2023-11-16 Thread via GitHub


danny0405 merged PR #10090:
URL: https://github.com/apache/hudi/pull/10090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]When hudi integrates hive, an error is reported when the hive external table is queried [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on issue #10084:
URL: https://github.com/apache/hudi/issues/10084#issuecomment-1815611713

   Flink 1.13.1 should use Parquet 1.11 right? Have you checked the project 
parquet version for other modules so you do not package multiple parquet jars 
in one shot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7099] Providing metrics for archive and defining some string constants [hudi]

2023-11-16 Thread via GitHub


danny0405 commented on code in PR #10101:
URL: https://github.com/apache/hudi/pull/10101#discussion_r1396571766


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java:
##
@@ -117,6 +122,10 @@ public boolean archiveIfRequired(HoodieEngineContext 
context, boolean acquireLoc
   } else {
 LOG.info("No Instants to archive");
   }
+  if (success && timerContext != null) {
+long durationMs = metrics.getDurationInMs(timerContext.stop());

Review Comment:
   Yeah, need to think through what the return type should look like.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


the-other-tim-brown commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r1396557170


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java:
##
@@ -79,7 +78,7 @@ public BigQuerySyncTool(Properties props) {
 this.bqSchemaResolver = BigQuerySchemaResolver.getInstance();
   }
 
-  @VisibleForTesting // allows us to pass in mocks for the writer and client
+  // allows us to pass in mocks for the writer and client

Review Comment:
   yes, accidentally removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10125:
URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815576193

   
   ## CI report:
   
   * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20961)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


the-other-tim-brown commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r139668


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .markAdvanced()
   .withDocumentation("Fetch file listing from Hudi's metadata");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.require_partition_filter")
+  .defaultValue(false)
+  .withDocumentation("If true, configure table to require a partition 
filter to be specified when querying the table");
+
+  public static final ConfigProperty 
BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty
+  .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id")
+  .noDefaultValue()
+  .withDocumentation("The Big Lake connection ID to use");
+

Review Comment:
   should it be 0.14.1?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


the-other-tim-brown commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r1396555489


##
hudi-gcp/pom.xml:
##
@@ -70,7 +70,6 @@ See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google
 
   com.google.cloud
   google-cloud-pubsub
-  ${google.cloud.pubsub.version}

Review Comment:
   We include the bom above so this was not required



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


yihua commented on code in PR #10125:
URL: https://github.com/apache/hudi/pull/10125#discussion_r1396550878


##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -83,7 +83,6 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .key("hoodie.gcp.bigquery.sync.use_bq_manifest_file")
   .defaultValue(false)
   .markAdvanced()
-  .sinceVersion("0.14.0")

Review Comment:
   nit: this is still needed.



##
hudi-gcp/pom.xml:
##
@@ -70,7 +70,6 @@ See 
https://github.com/GoogleCloudPlatform/cloud-opensource-java/wiki/The-Google
 
   com.google.cloud
   google-cloud-pubsub
-  ${google.cloud.pubsub.version}

Review Comment:
   Is this safe to remove now?



##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncTool.java:
##
@@ -79,7 +78,7 @@ public BigQuerySyncTool(Properties props) {
 this.bqSchemaResolver = BigQuerySchemaResolver.getInstance();
   }
 
-  @VisibleForTesting // allows us to pass in mocks for the writer and client
+  // allows us to pass in mocks for the writer and client

Review Comment:
   Is the annotation still needed?



##
hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java:
##
@@ -122,6 +121,16 @@ public class BigQuerySyncConfig extends HoodieSyncConfig 
implements Serializable
   .markAdvanced()
   .withDocumentation("Fetch file listing from Hudi's metadata");
 
+  public static final ConfigProperty 
BIGQUERY_SYNC_REQUIRE_PARTITION_FILTER = ConfigProperty
+  .key("hoodie.gcp.bigquery.sync.require_partition_filter")
+  .defaultValue(false)
+  .withDocumentation("If true, configure table to require a partition 
filter to be specified when querying the table");
+
+  public static final ConfigProperty 
BIGQUERY_SYNC_BIG_LAKE_CONNECTION_ID = ConfigProperty
+  .key("hoodie.onehouse.gcp.bigquery.sync.big_lake_connection_id")
+  .noDefaultValue()
+  .withDocumentation("The Big Lake connection ID to use");
+

Review Comment:
   Add `.sinceVersion("0.14.0")` and mark them advanced (`.markAdvanced()`, if 
they are not required to be set by the users)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


hudi-bot commented on PR #10125:
URL: https://github.com/apache/hudi/pull/10125#issuecomment-1815570231

   
   ## CI report:
   
   * 7c1b9cc77e2e5ea2ee9d6089f41b5a9c482de9f5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7115) Add more options for BigQuery Sync

2023-11-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7115:
-
Labels: pull-request-available  (was: )

> Add more options for BigQuery Sync
> --
>
> Key: HUDI-7115
> URL: https://issues.apache.org/jira/browse/HUDI-7115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> There are options for requiring a partition filter and adding a big lake 
> connection ID to leverage some new access control features that users may 
> want to leverage in their environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7115] Add in new options for the bigquery sync [hudi]

2023-11-16 Thread via GitHub


the-other-tim-brown opened a new pull request, #10125:
URL: https://github.com/apache/hudi/pull/10125

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7115) Add more options for BigQuery Sync

2023-11-16 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-7115:
---

 Summary: Add more options for BigQuery Sync
 Key: HUDI-7115
 URL: https://issues.apache.org/jira/browse/HUDI-7115
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Timothy Brown


There are options for requiring a partition filter and adding a big lake 
connection ID to leverage some new access control features that users may want 
to leverage in their environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] RFC 63 Functional Index Hudi 0.1.0-beta [hudi]

2023-11-16 Thread via GitHub


soumilshah1995 commented on issue #10110:
URL: https://github.com/apache/hudi/issues/10110#issuecomment-1815424659

   # Code
   
   ```
   
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType, StructField, StringType, 
TimestampType, FloatType
   from datetime import datetime
   import os
   import sys
   
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType, StructField, StringType, 
TimestampType, FloatType
   from datetime import datetime
   import os
   import sys
   
   
   HUDI_VERSION = '1.0.0-beta1'
   SPARK_VERSION = '3.4'
   
   SUBMIT_ARGS = f"--packages 
org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} 
pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   
   # Spark session
   spark = SparkSession.builder \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .config('className', 'org.apache.hudi') \
   .config('spark.sql.hive.convertMetastoreParquet', 'false') \
   .getOrCreate()
   
   
   data = [
   [1695159649, '334e26e9-8355-45cc-97c6-c31daf0df330', 'rider-A', 
'driver-K', 19.10, 'san_francisco'],
   [1695159649, 'e96c4396-3fad-413a-a942-4cb36106d721', 'rider-C', 
'driver-M', 27.70, 'san_francisco'],
   ]
   
   # Define schema for the DataFrame
   schema = StructType([
   StructField("ts", StringType(), True),
   StructField("transaction_id", StringType(), True),
   StructField("rider", StringType(), True),
   StructField("driver", StringType(), True),
   StructField("price", FloatType(), True),
   StructField("location", StringType(), True),
   ])
   
   # Create Spark DataFrame
   df = spark.createDataFrame(data, schema=schema)
   
   df.show()
   
   
   path = 'file:///Users/soumilnitinshah/Downloads/hudidb/hudi_table_func_index'
   
   
   hudi_options = {
   'hoodie.table.name': 'hudi_table_func_index',
   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
   'hoodie.datasource.write.operation': 'upsert',
   'hoodie.datasource.write.recordkey.field': 'transaction_id',
   'hoodie.datasource.write.precombine.field': 'ts',
   'hoodie.table.metadata.enable': 'true',
   'hoodie.datasource.write.partitionpath.field': 'location',
   'hoodie.parquet.small.file.limit':'0'
   }
   
   
   df.write.format("hudi").options(**hudi_options).mode("append").save(path)
   
   
   
   
   PATH = 'file:///Users/soumilnitinshah/Downloads/hudidb/hudi_table_func_index'
   TABLE_NAME = "hudi_table_func_index"
   
   spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)
   
   spark.sql(f"""SELECT from_unixtime(ts, '-MM-dd') as datestr FROM 
{TABLE_NAME}""").show()
   spark.sql(f"""CREATE INDEX {TABLE_NAME}_datestr ON {TABLE_NAME} USING 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')""")
   
   
   ```
   
   # Error 
   ```
   
   +--+
   |   datestr|
   +--+
   |2023-09-19|
   |2023-09-19|
   +--+
   
   ---
   Py4JJavaError Traceback (most recent call last)
   Cell In[7], line 7
 4 
spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)
 6 spark.sql(f"""SELECT from_unixtime(ts, '-MM-dd') as datestr FROM 
{TABLE_NAME}""").show()
   > 7 spark.sql(f"""CREATE INDEX {TABLE_NAME}_datestr ON {TABLE_NAME} 
USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')""")
   
   File ~/anaconda3/lib/python3.11/site-packages/pyspark/sql/session.py:1440, 
in SparkSession.sql(self, sqlQuery, args, **kwargs)
  1438 try:
  1439 litArgs = {k: _to_java_column(lit(v)) for k, v in (args or 
{}).items()}
   -> 1440 return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), 
self)
  1441 finally:
  1442 if len(kwargs) > 0:
   
   File ~/anaconda3/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in 
JavaMember.__call__(self, *args)
  1316 command = proto.CALL_COMMAND_NAME +\
  1317 self.command_header +\
  1318 args_command +\
  1319 proto.END_COMMAND_PART
  1321 answer = self.gateway_client.send_command(command)
   -> 1322 return_value = get_return_value(
  1323 answer, self.gateway_client, self.target_id, self.name)
  1325 for temp_arg in temp_args:
  1326 if hasattr(temp_arg, "_detach"):
   
   File 
~/anaconda3/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py:169,
 in capture_sql_exception..deco(*a, **kw)
   167 def deco(*a: Any, **kw: Any) -> Any:
   168 try:
   --> 169 return f(*a, **kw)
   170 except Py4JJavaError as e:
   171 converted = convert_exception(e.java_exception)
   
   File 

Re: [PR] [MINOR] Fix default config values if not specified in MultipleSparkJobExecutionStrategy [hudi]

2023-11-16 Thread via GitHub


nsivabalan commented on code in PR #9625:
URL: https://github.com/apache/hudi/pull/9625#discussion_r1396445022


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -116,7 +116,7 @@ public HoodieWriteMetadata> 
performClustering(final Hood
   Stream> writeStatusesStream = FutureUtils.allOf(
   clusteringPlan.getInputGroups().stream()
   .map(inputGroup -> {
-if 
(getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable",
 false)) {

Review Comment:
   @yihua : this was intentionally kept it as false (default value). 
   So only if user explicitly enabled row writer, we will enable row writer w/ 
clustering. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Unable to alter column name for a Hudi table in AWS [hudi]

2023-11-16 Thread via GitHub


soumilshah1995 commented on issue #9780:
URL: https://github.com/apache/hudi/issues/9780#issuecomment-1815417347

   Found the code 
   
   here is code 
https://github.com/soumilshah1995/code-snippets/blob/main/schema_evol_lab.ipynb
   
   Here is Video Guide 
   https://www.youtube.com/watch?v=-_TFB0Gh3TY
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Unable to alter column name for a Hudi table in AWS [hudi]

2023-11-16 Thread via GitHub


soumilshah1995 commented on issue #9780:
URL: https://github.com/apache/hudi/issues/9780#issuecomment-1815415276

   let me search my code snip I know I have done this on AWS 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6958] Simplify Out Of Box Schema Evolution Functionality - DOCS [hudi]

2023-11-16 Thread via GitHub


lokesh-lingarajan-0310 commented on code in PR #9881:
URL: https://github.com/apache/hudi/pull/9881#discussion_r1396339333


##
website/docs/schema_evolution.md:
##
@@ -22,21 +22,36 @@ the previous schema (e.g., renaming a column).
 Furthermore, the evolved schema is queryable across high-performance engines 
like Presto and Spark SQL without additional overhead for column ID 
translations or
 type reconciliations. The following table summarizes the schema changes 
compatible with different Hudi table types.
 
-| Schema Change
| COW  | MOR | Remarks  


 |
-|:-|:-|:|:--|
-| Add a new nullable column at root level at the end   
| Yes  | Yes | `Yes` means that a write with evolved schema 
succeeds and a read following the write succeeds to read entire dataset.


 |
-| Add a new nullable column to inner struct (at the end)   
| Yes  | Yes |
-| Add a new complex type field with default (map and array)
| Yes  | Yes |  


 |
-| Add a new nullable column and change the ordering of fields  
| No   | No  | Write succeeds but read fails if the write with 
evolved schema updated only some of the base files but not all. Currently, Hudi 
does not maintain a schema registry with history of changes across base files. 
Nevertheless, if the upsert touched all base files then the read will succeed. |
-| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col`  
| Yes  | Yes |  


 |
-| Promote datatype from `int` to `long` for a field at root level  
| Yes  | Yes | For other types, Hudi supports promotion as 
specified in [Avro schema 
resolution](http://avro.apache.org/docs/current/spec#Schema+Resolution).

|
-| Promote datatype from `int` to `long` for a nested field 
| Yes  | Yes |
-| Promote datatype from `int` to `long` for a complex type (value of map or 
array) | Yes  | Yes |   


|
-| Add a new non-nullable column at root level at the end   
| No   | No  | In case of MOR table with Spark data source, write 
succeeds but read fails. As a **workaround**, you can make the field nullable.  

   |
-| Add a new non-nullable column to inner struct (at the end)   
| No   | No  |  


 |
-| Change datatype from `long` to `int` for a nested field  
| No   | No  |  

  

  1   2   3   >