[jira] [Updated] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites

2024-01-30 Thread eric (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eric updated HUDI-7361:
---
Description: 
{quote}CREATE TABLE tbl (
..
) WITH (
'connector' = 'hudi',
'path' = '/tblpath',
'table.type' = 'COPY_ON_WRITE',
'write.bucket_assign.tasks'='5',
'write.operation'='insert',
'write.tasks'='5', 
'clustering.schedule.enabled'='true',
'clustering.async.enabled'='true',
'clustering.delta_commits'='3',
'clustering.tasks'='5',
'hoodie.cleaner.policy.failed.writes'='LAZY'
);
{quote}
*Table parameters are as above*

 

*From jbmanager and taskmanager log, we can summarize the process of abnormal 
triggering:* 


before the writeClient complete the commit 20240126154725671, the clean table 
service starts to work, and the failed Writes rollback needs to be checked and 
completed during the clean process. 

This method will verify whether the heartbeats of all inflight instants are 
overtime and rollback which instants have overtime heartbeats. At the same 
time, the write client has completed the commit 20240126154725671 and deleted 
the heartbeat file of this instant. 

The clean table service client obtained the last heartbeat of 0, so it rolled 
back this instant.

> Fix a concurrency issue caused by rollbackFailedWrites
> --
>
> Key: HUDI-7361
> URL: https://issues.apache.org/jira/browse/HUDI-7361
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: eric
>Priority: Major
> Attachments: jobmanager_log.txt, taskmanager_log.txt
>
>
> {quote}CREATE TABLE tbl (
> ..
> ) WITH (
> 'connector' = 'hudi',
> 'path' = '/tblpath',
> 'table.type' = 'COPY_ON_WRITE',
> 'write.bucket_assign.tasks'='5',
> 'write.operation'='insert',
> 'write.tasks'='5', 
> 'clustering.schedule.enabled'='true',
> 'clustering.async.enabled'='true',
> 'clustering.delta_commits'='3',
> 'clustering.tasks'='5',
> 'hoodie.cleaner.policy.failed.writes'='LAZY'
> );
> {quote}
> *Table parameters are as above*
>  
> *From jbmanager and taskmanager log, we can summarize the process of abnormal 
> triggering:* 
> before the writeClient complete the commit 20240126154725671, the clean table 
> service starts to work, and the failed Writes rollback needs to be checked 
> and completed during the clean process. 
> This method will verify whether the heartbeats of all inflight instants are 
> overtime and rollback which instants have overtime heartbeats. At the same 
> time, the write client has completed the commit 20240126154725671 and deleted 
> the heartbeat file of this instant. 
> The clean table service client obtained the last heartbeat of 0, so it rolled 
> back this instant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites

2024-01-30 Thread eric (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eric updated HUDI-7361:
---
Attachment: jobmanager_log.txt
taskmanager_log.txt

> Fix a concurrency issue caused by rollbackFailedWrites
> --
>
> Key: HUDI-7361
> URL: https://issues.apache.org/jira/browse/HUDI-7361
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: eric
>Priority: Major
> Attachments: jobmanager_log.txt, taskmanager_log.txt
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7361) Fix a concurrency issue caused by rollbackFailedWrites

2024-01-30 Thread eric (Jira)
eric created HUDI-7361:
--

 Summary: Fix a concurrency issue caused by rollbackFailedWrites
 Key: HUDI-7361
 URL: https://issues.apache.org/jira/browse/HUDI-7361
 Project: Apache Hudi
  Issue Type: Bug
  Components: writer-core
Reporter: eric






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception

2024-01-30 Thread Aditya Goenka (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka updated HUDI-7360:

Description: 
Github Issue - [https://github.com/apache/hudi/issues/10590]

Reproducible code 

```
from typing import Any

from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
.appName("Hudi Basics") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.jars.packages", 
"org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \
.config("spark.sql.extensions", 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
.config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
.getOrCreate()

sc = spark.sparkContext

table_name = "hudi_trips_cdc"
base_path = "/tmp/test_issue_10590_4" # Replace for whatever path
quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils
dataGen = quickstart_utils.DataGenerator()

inserts = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))


def create_df():
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
return df


def write_data():
df = create_df()
hudi_options = {
"hoodie.table.name": table_name,
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either MoR 
or CoW and the error will still happen
"hoodie.datasource.write.partitionpath.field": "partitionpath",
"hoodie.datasource.write.table.name": table_name,
"hoodie.datasource.write.operation": "upsert",
"hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t 
affect anything unless actually queried as CDC
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.upsert.shuffle.parallelism": 2,
"hoodie.insert.shuffle.parallelism": 2
}
df.write.format("hudi") \
.options(**hudi_options) \
.mode("overwrite") \
.save(base_path)


def update_data():
updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10))
df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.write \
.format("hudi") \
.mode("append") \
.save(base_path)


def incremental_query():
ordered_rows: list[Row] = spark.read \
.format("hudi") \
.load(base_path) \
.select(col("_hoodie_commit_time").alias("commit_time")) \
.orderBy(col("commit_time")) \
.collect()
commits: list[Any] = list(map(lambda row: row[0], ordered_rows))
begin_time = commits[0]
incremental_read_options = {
'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to 
Query as CDC, crashes in 0.14.1
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': begin_time,
}
trips_incremental_df = spark.read \
.format("hudi") \
.options(**incremental_read_options) \
.load(base_path)
# Error also occurs when using the "from_hudi_table_changes" in 0.14.1
# sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', 
'earliest')"""
# trips_incremental_df = spark.sql(sql_query)
trips_incremental_df.show()
trips_incremental_df.printSchema()


if __name__ == "__main__":
write_data()
update_data()
incremental_query()
```
 

 

 

 

  was:Github Issue - [https://github.com/apache/hudi/issues/10590]


> Incremental CDC Query after 0.14.1 upgrade giving Jackson class 
> incompatibility exception
> -
>
> Key: HUDI-7360
> URL: https://issues.apache.org/jira/browse/HUDI-7360
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/10590]
> Reproducible code 
> ```
> from typing import Any
> from pyspark import Row
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col
> spark = SparkSession.builder \
> .appName("Hudi Basics") \
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
> .config("spark.jars.packages", 
> "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \
> .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
> .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
> .getOrCreate()
> sc = spark.sparkContext
> table_name = "hudi_trips_cdc"
> base_path = "/tmp/test_issue_10590_4" # Replace for whatever path
> quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils
> dataGen = quickstart_utils.DataGenerator()
> inserts = 
> sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
> def create_df():
> df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> return df
> def write_data():
> df = create_df()
> hudi_options = {
> "hoodie.table.name": table_name,
> "hoodie.datasource.write.recordkey.field": "uuid",
> "hoodie.da

[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception

2024-01-30 Thread Aditya Goenka (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka updated HUDI-7360:

Summary: Incremental CDC Query after 0.14.1 upgrade giving Jackson class 
incompatibility exception  (was: Incremental CDC Query after 0.14.X upgrade 
giving Jackson class incompatibility exception)

> Incremental CDC Query after 0.14.1 upgrade giving Jackson class 
> incompatibility exception
> -
>
> Key: HUDI-7360
> URL: https://issues.apache.org/jira/browse/HUDI-7360
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/10590]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7360) Incremental CDC Query after 0.14.X upgrade giving Jackson class incompatibility exception

2024-01-30 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-7360:
---

 Summary: Incremental CDC Query after 0.14.X upgrade giving Jackson 
class incompatibility exception
 Key: HUDI-7360
 URL: https://issues.apache.org/jira/browse/HUDI-7360
 Project: Apache Hudi
  Issue Type: Bug
  Components: reader-core
Reporter: Aditya Goenka
 Fix For: 1.1.0


Github Issue - [https://github.com/apache/hudi/issues/10590]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7320) hive-sync unexpectedly loads archived timeline

2024-01-30 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812499#comment-17812499
 ] 

sivabalan narayanan commented on HUDI-7320:
---

We did fix something on these lines already. Can you check if its reproducible 
w/ 0.14.0 as well ? 

 

> hive-sync unexpectedly loads archived timeline
> --
>
> Key: HUDI-7320
> URL: https://issues.apache.org/jira/browse/HUDI-7320
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Affects Versions: 0.13.1
>Reporter: Raymond Xu
>Priority: Critical
> Attachments: Screenshot 2024-01-16 at 5.49.25 PM.png, Screenshot 
> 2024-01-16 at 5.49.30 PM.png
>
>
> investigation shows that hive-sync step loaded archived timeline and caused 
> long delay in the overall write process. And full scan for changes in all 
> partitions is not used. need to dig further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] added new videos for hudi oss site [hudi]

2024-01-30 Thread via GitHub


nfarah86 commented on PR #10563:
URL: https://github.com/apache/hudi/pull/10563#issuecomment-1917696609

   @bhasudha made the aws -> amazon changes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10586:
URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917684020

   
   ## CI report:
   
   * e6a99b7319648fce943abc73b460239350ff18d3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10575:
URL: https://github.com/apache/hudi/pull/10575#issuecomment-1917510979

   
   ## CI report:
   
   * 24d06d5c92ebb9ef98c4689365eabd1e197c7197 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   * 806bd78e4b5f1bc0de9950daeee59dceccba9941 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10586:
URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917495130

   
   ## CI report:
   
   * e6a99b7319648fce943abc73b460239350ff18d3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10575:
URL: https://github.com/apache/hudi/pull/10575#issuecomment-1917494985

   
   ## CI report:
   
   * 24d06d5c92ebb9ef98c4689365eabd1e197c7197 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   * 806bd78e4b5f1bc0de9950daeee59dceccba9941 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10586:
URL: https://github.com/apache/hudi/pull/10586#issuecomment-1917480260

   
   ## CI report:
   
   * e6a99b7319648fce943abc73b460239350ff18d3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7347][Stacked on HUDI-7335] Introduce SeekableDataInputStream for random access [hudi]

2024-01-30 Thread via GitHub


yihua commented on code in PR #10575:
URL: https://github.com/apache/hudi/pull/10575#discussion_r1471603752


##
hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HadoopSeekableDataInputStream.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.hadoop.fs;
+
+import org.apache.hudi.io.SeekableDataInputStream;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+
+import java.io.IOException;
+
+/**
+ * An implementation of {@link SeekableDataInputStream} based on Hadoop's 
{@link FSDataInputStream}
+ */
+public class HadoopSeekableDataInputStream extends SeekableDataInputStream {
+  private final FSDataInputStream stream;
+
+  public HadoopSeekableDataInputStream(FSDataInputStream stream) {
+super(stream);
+this.stream = stream;
+  }
+
+  @Override
+  public long getPosition() throws IOException {

Review Comment:
   Fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR (#10570)

2024-01-30 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a078242b19d [HUDI-7343] Replace Path.SEPARATOR with 
HoodieLocation.SEPARATOR (#10570)
a078242b19d is described below

commit a078242b19dc3f8b46d08e197d8b77fa34f1808a
Author: Y Ethan Guo 
AuthorDate: Tue Jan 30 08:47:30 2024 -0800

[HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR (#10570)
---
 .../apache/hudi/cli/commands/ExportCommand.java|  5 +--
 .../cli/commands/TestHoodieLogFileCommand.java |  3 +-
 .../apache/hudi/cli/commands/TestTableCommand.java |  5 +--
 .../hudi/cli/integ/ITTestBootstrapCommand.java |  9 +++---
 .../cli/integ/ITTestHDFSParquetImportCommand.java  |  5 +--
 .../hudi/cli/integ/ITTestMarkersCommand.java   |  5 +--
 .../hudi/cli/integ/ITTestSavepointsCommand.java|  3 +-
 .../apache/hudi/cli/integ/ITTestTableCommand.java  | 12 
 .../hudi/client/heartbeat/HeartbeatUtils.java  |  3 +-
 .../client/heartbeat/HoodieHeartbeatClient.java|  6 ++--
 .../lock/FileSystemBasedLockProvider.java  |  7 +++--
 .../BaseHoodieFunctionalIndexClient.java   |  5 ++-
 .../timeline/TestCompletionTimeQueryView.java  |  6 ++--
 .../utils/TestLegacyArchivedMetaEntryReader.java   |  5 +--
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |  9 +++---
 .../client/utils/SparkMetadataWriterUtils.java |  3 +-
 .../hudi/client/TestHoodieClientMultiWriter.java   |  3 +-
 .../functional/TestHoodieBackedMetadata.java   | 19 ++--
 .../DirectMarkerBasedDetectionStrategy.java|  3 +-
 .../hudi/common/fs/inline/InLineFSUtils.java   | 12 +---
 .../common/heartbeat/HoodieHeartbeatUtils.java |  4 ++-
 .../hudi/common/table/HoodieTableMetaClient.java   | 36 --
 .../hudi/metadata/AbstractHoodieTableMetadata.java |  9 +++---
 .../hudi/metadata/HoodieMetadataPayload.java   |  3 +-
 .../apache/hudi/metadata/HoodieTableMetadata.java  | 11 ---
 .../hudi/metadata/HoodieTableMetadataUtil.java |  3 +-
 .../common/fs/TestHoodieWrapperFileSystem.java |  3 +-
 .../org/apache/hudi/sink/meta/CkpMetadata.java |  4 ++-
 .../java/org/apache/hudi/source/FileIndex.java |  3 +-
 .../hudi/table/catalog/TableOptionProperties.java  |  3 +-
 .../apache/hudi/table/format/FilePathUtils.java|  5 +--
 .../main/java/org/apache/hudi/util/ClientIds.java  |  3 +-
 .../apache/hudi/util/ViewStorageProperties.java|  3 +-
 .../apache/hudi/sink/ITTestDataStreamWrite.java|  3 +-
 .../hudi/sink/bucket/ITTestBucketStreamWrite.java  |  3 +-
 .../org/apache/hudi/sink/utils/TestWriteBase.java  |  4 ++-
 .../test/java/org/apache/hudi/utils/TestUtils.java |  3 +-
 .../hudi/hadoop/utils/HoodieInputFormatUtils.java  |  3 +-
 .../apache/hudi/hadoop/TestInputPathHandler.java   | 13 
 .../procedures/ExportInstantsProcedure.scala   | 16 +-
 .../apache/hudi/testutils/DataSourceTestUtils.java |  3 +-
 .../org/apache/hudi/TestHoodieFileIndex.scala  | 19 +++-
 .../hudi/procedure/TestBootstrapProcedure.scala| 25 +++
 .../procedure/TestHdfsParquetImportProcedure.scala |  5 +--
 .../hudi/analysis/HoodieSpark32PlusAnalysis.scala  | 18 +--
 .../hudi/hive/testutils/HiveTestService.java   |  4 +--
 .../MarkerBasedEarlyConflictDetectionRunnable.java |  3 +-
 .../utilities/streamer/SparkSampleWritesUtils.java |  3 +-
 48 files changed, 197 insertions(+), 146 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
index 40e7154b5f9..b196c62d0fb 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
@@ -44,6 +44,7 @@ import 
org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
 import org.apache.hudi.common.util.collection.ClosableIterator;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.hadoop.fs.HadoopFSUtils;
+import org.apache.hudi.storage.HoodieLocation;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -168,7 +169,7 @@ public class ExportCommand {
   LOG.error("Could not load metadata for action " + action + " at 
instant time " + instantTime);
   continue;
 }
-final String outPath = localFolder + Path.SEPARATOR + instantTime 
+ "." + action;
+final String outPath = localFolder + HoodieLocation.SEPARATOR + 
instantTime + "." + action;
 writeToFile(outPath, HoodieAvroUtils.avroToJson(metadata, true));
   }
 }
@@ -190,7 +191,7 @@ public class ExportCommand {
 final HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
 final HoodieActiveTimeline t

Re: [PR] [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR [hudi]

2024-01-30 Thread via GitHub


yihua merged PR #10570:
URL: https://github.com/apache/hudi/pull/10570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7343] Replace Path.SEPARATOR with HoodieLocation.SEPARATOR [hudi]

2024-01-30 Thread via GitHub


yihua commented on PR #10570:
URL: https://github.com/apache/hudi/pull/10570#issuecomment-1917304625

   > I didn't check all the usages of the `Path.SEPARATOR`, the change looks 
straight-forward so I approved it.
   
   Yes, `Path.SEPARATOR` usages are all replaced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-01-30 Thread via GitHub


jonvex commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1471439782


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.hadoop.utils;
+
+import com.github.benmanes.caffeine.cache.Cache;
+import com.github.benmanes.caffeine.cache.Caffeine;
+import org.apache.avro.Schema;
+import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector;
+import org.apache.hadoop.hive.serde.serdeConstants;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapred.JobConf;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+/**
+ * To read value from an ArrayWritable, an ObjectInspector is needed.
+ * Object inspectors are cached here or created using the column type map.
+ */
+public class ObjectInspectorCache {
+  private final Map columnTypeMap = new HashMap<>();
+  private final Cache
+  objectInspectorCache = Caffeine.newBuilder().maximumSize(1000).build();
+
+  public Map getColumnTypeMap() {
+return columnTypeMap;
+  }
+
+  public ObjectInspectorCache(Schema tableSchema, JobConf jobConf) {
+//From AbstractRealtimeRecordReader#prepareHiveAvroSerializer
+// hive will append virtual columns at the end of column list. we should 
remove those columns.
+// eg: current table is col1, col2, col3; 
jobConf.get(serdeConstants.LIST_COLUMNS): col1, col2, col3 
,BLOCK__OFFSET__INSIDE__FILE ...
+Set writerSchemaColNames = tableSchema.getFields().stream().map(f 
-> f.name().toLowerCase(Locale.ROOT)).collect(Collectors.toSet());
+List columnNameList = 
Arrays.stream(jobConf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+List columnTypeList =  
TypeInfoUtils.getTypeInfosFromTypeString(jobConf.get(serdeConstants.LIST_COLUMN_TYPES));
+
+int columnNameListLen = columnNameList.size() - 1;
+for (int i = columnNameListLen; i >= 0; i--) {
+  String lastColName = columnNameList.get(columnNameList.size() - 1);
+  // virtual columns will only append at the end of column list. it will 
be ok to break the loop.
+  if (writerSchemaColNames.contains(lastColName)) {
+break;
+  }
+  columnNameList.remove(columnNameList.size() - 1);
+  columnTypeList.remove(columnTypeList.size() - 1);
+}
+
+//Use columnNameList.size() instead of columnTypeList because the type 
list is longer for some reason
+IntStream.range(0, columnNameList.size()).boxed().forEach(i -> 
columnTypeMap.put(columnNameList.get(i),
+
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0)));
+
+StructTypeInfo rowTypeInfo = (StructTypeInfo) 
TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+ArrayWritableObjectInspector objectInspector = new 
ArrayWritableObjectInspector(rowTypeInfo);

Review Comment:
   FYI this is pretty much a copy of 
https://github.com/apache/hudi/blob/e9389ffde53fa2b28feba248b7e8f17fd565e458/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L111



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-01-30 Thread via GitHub


jonvex commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1471420556


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java:
##
@@ -91,9 +94,42 @@ private void initAvroInputFormat() {
 }
   }
 
+  private static boolean checkIfHudiTable(final InputSplit split, final 
JobConf job) {
+try {
+  Option tablePathOpt = TablePathUtils.getTablePath(((FileSplit) 
split).getPath(), job);
+  if (!tablePathOpt.isPresent()) {
+return false;
+  }
+  return tablePathOpt.get().getFileSystem(job).exists(new 
Path(tablePathOpt.get(), HoodieTableMetaClient.METAFOLDER_NAME));
+} catch (IOException e) {
+  return false;
+}
+  }
+
   @Override
   public RecordReader getRecordReader(final 
InputSplit split, final JobConf job,
final 
Reporter reporter) throws IOException {
+
+if (HoodieFileGroupReaderRecordReader.useFilegroupReader(job)) {
+  try {

Review Comment:
   
https://github.com/apache/hudi/blob/2c38ef740d3d34e9eb05b59fa147c55623b81a90/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java#L249
 I remove the partition fields from the read columns if the parquet file 
doesn't contain them. Does that help?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7045] Create parquet readers inside the reader context and implement schema.on.read in the filegroup reader in spark [hudi]

2024-01-30 Thread via GitHub


jonvex commented on PR #10278:
URL: https://github.com/apache/hudi/pull/10278#issuecomment-1917113112

   Azure CI all passing @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] added new videos for hudi oss site [hudi]

2024-01-30 Thread via GitHub


bhasudha commented on code in PR #10563:
URL: https://github.com/apache/hudi/pull/10563#discussion_r1471267966


##
website/videoBlog/2023-10-14-Accelerating-Data-Processing-Leveraging-Apache-Hudi-with-DynamoDB-for-Faster-Commit-Time-Retrieval.md:
##
@@ -8,7 +8,7 @@ image: 
/assets/images/video_blogs/2023-10-14-Accelerating-Data-Processing-Levera
 navigate: "https://www.youtube.com/watch?v=YF8zq_nuSHE";
 tags:
 - guide
-- dyanmodb
+- aws dyanmodb

Review Comment:
   similar as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] added new videos for hudi oss site [hudi]

2024-01-30 Thread via GitHub


bhasudha commented on code in PR #10563:
URL: https://github.com/apache/hudi/pull/10563#discussion_r1471267410


##
website/videoBlog/2023-08-06-Easy_Step_by_Step_Guide_for_Beginner_Setup_AWS_Transfer_Family_SFTP_with_S3.md:
##
@@ -11,7 +11,7 @@ tags:
 - third-party data
 - sftp
 - aws transfer family
-- amazon s3
+- aws s3

Review Comment:
   similar as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] added new videos for hudi oss site [hudi]

2024-01-30 Thread via GitHub


bhasudha commented on code in PR #10563:
URL: https://github.com/apache/hudi/pull/10563#discussion_r1471266750


##
website/videoBlog/2023-08-03-Powering_EventDriven_Workloads_with_Hudi_Read_Stream_AWS_Glue_Streaming_JOBS.md:
##
@@ -14,6 +14,6 @@ tags:
 - streaming
 - near real-time analytics
 - event bus
-- amazon sqs
+- aws sqs

Review Comment:
   Lets leave it as amazon instead of aws. The reason, I added this tag is 
based on how the sqs documentation calls it. For ex: 
https://aws.amazon.com/sqs/ names it `Amazon SQS` not `aws sqs`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-01-30 Thread via GitHub


xiarixiaoyao commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1471121556


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/ObjectInspectorCache.java:
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.hadoop.utils;
+
+import com.github.benmanes.caffeine.cache.Cache;
+import com.github.benmanes.caffeine.cache.Caffeine;
+import org.apache.avro.Schema;
+import org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector;
+import org.apache.hadoop.hive.serde.serdeConstants;
+import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.mapred.JobConf;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+/**
+ * To read value from an ArrayWritable, an ObjectInspector is needed.
+ * Object inspectors are cached here or created using the column type map.
+ */
+public class ObjectInspectorCache {
+  private final Map columnTypeMap = new HashMap<>();
+  private final Cache
+  objectInspectorCache = Caffeine.newBuilder().maximumSize(1000).build();
+
+  public Map getColumnTypeMap() {
+return columnTypeMap;
+  }
+
+  public ObjectInspectorCache(Schema tableSchema, JobConf jobConf) {
+//From AbstractRealtimeRecordReader#prepareHiveAvroSerializer
+// hive will append virtual columns at the end of column list. we should 
remove those columns.
+// eg: current table is col1, col2, col3; 
jobConf.get(serdeConstants.LIST_COLUMNS): col1, col2, col3 
,BLOCK__OFFSET__INSIDE__FILE ...
+Set writerSchemaColNames = tableSchema.getFields().stream().map(f 
-> f.name().toLowerCase(Locale.ROOT)).collect(Collectors.toSet());
+List columnNameList = 
Arrays.stream(jobConf.get(serdeConstants.LIST_COLUMNS).split(",")).collect(Collectors.toList());
+List columnTypeList =  
TypeInfoUtils.getTypeInfosFromTypeString(jobConf.get(serdeConstants.LIST_COLUMN_TYPES));
+
+int columnNameListLen = columnNameList.size() - 1;
+for (int i = columnNameListLen; i >= 0; i--) {
+  String lastColName = columnNameList.get(columnNameList.size() - 1);
+  // virtual columns will only append at the end of column list. it will 
be ok to break the loop.
+  if (writerSchemaColNames.contains(lastColName)) {
+break;
+  }
+  columnNameList.remove(columnNameList.size() - 1);
+  columnTypeList.remove(columnTypeList.size() - 1);
+}
+
+//Use columnNameList.size() instead of columnTypeList because the type 
list is longer for some reason
+IntStream.range(0, columnNameList.size()).boxed().forEach(i -> 
columnTypeMap.put(columnNameList.get(i),
+
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeList.get(i).getQualifiedName()).get(0)));
+
+StructTypeInfo rowTypeInfo = (StructTypeInfo) 
TypeInfoFactory.getStructTypeInfo(columnNameList, columnTypeList);
+ArrayWritableObjectInspector objectInspector = new 
ArrayWritableObjectInspector(rowTypeInfo);

Review Comment:
   > There may be compatibility issues between hive2 and hive3. DATE, TIMESTAMP
   
   I think hive will handle this itself. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [hudi bucket prune] [hudi]

2024-01-30 Thread via GitHub


lookingUpAtTheSky opened a new issue, #10589:
URL: https://github.com/apache/hudi/issues/10589

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? 
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   Hello, I have two questions.
   
   Firstly, is there a clear plan for spark to support bucket prune?
   
   Second, when we calculate bucketId of field value, which  method is right to 
format value,   HoodieAvroUtils.convertValueForSpecificDataTypes or 
ExpressionUtils.getKeyFromLiteral or any other?
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries [hudi]

2024-01-30 Thread via GitHub


majian1998 commented on issue #10580:
URL: https://github.com/apache/hudi/issues/10580#issuecomment-1916596628

   @ad1happy2go I understand that the issue started when the PR [HUDI-4687]  
introduced the use of jol to estimate object size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries [hudi]

2024-01-30 Thread via GitHub


ad1happy2go commented on issue #10580:
URL: https://github.com/apache/hudi/issues/10580#issuecomment-1916559418

   @majian1998 Is this issue occurring after 0.14.0 upgrade or it was happening 
with older Hudi version too?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Can't delete key (row) for all commits in HUDI Table (history)? [hudi]

2024-01-30 Thread via GitHub


ad1happy2go commented on issue #10581:
URL: https://github.com/apache/hudi/issues/10581#issuecomment-1916557526

   @jens4doc Dont think there is a way to achieve that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Hudi 6868 - Support extracting passwords from credential store for Hive Sync [hudi]

2024-01-30 Thread via GitHub


hudi-bot commented on PR #10577:
URL: https://github.com/apache/hudi/pull/10577#issuecomment-1916344140

   
   ## CI report:
   
   * 40cbc324442334d3e1313f995c8ae9feed7d0db7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]

2024-01-30 Thread via GitHub


gsudhanshu commented on issue #10503:
URL: https://github.com/apache/hudi/issues/10503#issuecomment-1916285606

   @ad1happy2go thanks for your inputs. 
   
   I had made changes in path and removed unneccary keys. But still facing the 
same issue of Filenotfound exception.
   
   It seems like I will have to downgrade to 0.13.1 and standalone mode


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org