[GitHub] [hudi] PhantomHunt commented on issue #9344: [SUPPORT] Getting error when writing to different HUDI tables in different threads in same job

2023-08-02 Thread via GitHub


PhantomHunt commented on issue #9344:
URL: https://github.com/apache/hudi/issues/9344#issuecomment-1663332820

   We have a Job running on EC2 ubuntu machine that upserts data into 2 hudi 
tables parallelly in 2 threads (using threadPoolExecutor in the concurrent 
library of Python) at a time. There are 17 tables in total. When upsertion in 
any one of the tables is finished, threadPoolExecutor takes in another table to 
process in the available free thread. The Job terminates when upsertion in all 
17 tables finishes. This job runs every 5 mins via cronjob.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6634) Add support for schemaProvider in CloudObjectsSelectorCommon

2023-08-02 Thread Harshal Patil (Jira)
Harshal Patil created HUDI-6634:
---

 Summary: Add support for  schemaProvider in 
CloudObjectsSelectorCommon
 Key: HUDI-6634
 URL: https://issues.apache.org/jira/browse/HUDI-6634
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Harshal Patil


There should be way to give schema while loading files from CloudObjects . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663324203

   
   ## CI report:
   
   * d2360a5a7de655991202680013d20268ce325666 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19016)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9209:
URL: https://github.com/apache/hudi/pull/9209#issuecomment-1663323825

   
   ## CI report:
   
   * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN
   * 4ade37c10c908c0422915aaa489208e6ee62bb0d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18997)
 
   * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


danny0405 commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663319058

   @voonhous It is great if you can put this issue in higher priority, there 
are still 2 days for 0.14.0 release code freeze.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9330:
URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663317943

   
   ## CI report:
   
   * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990)
 
   * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN
   * d6d32a693c455830a31b883915e9940fa309c77f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9209:
URL: https://github.com/apache/hudi/pull/9209#issuecomment-1663317607

   
   ## CI report:
   
   * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN
   * 4ade37c10c908c0422915aaa489208e6ee62bb0d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18997)
 
   * 57c1b843608a9b63d143ead5dd5168613bb13969 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-02 Thread via GitHub


danny0405 commented on code in PR #9330:
URL: https://github.com/apache/hudi/pull/9330#discussion_r1282663938


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -690,7 +690,7 @@ private static HoodieTableMetaClient 
newMetaClient(Configuration conf, String ba
 ? (HoodieTableMetaClient) 
ReflectionUtils.loadClass("org.apache.hudi.common.table.HoodieTableMetaserverClient",
 new Class[]{Configuration.class, String.class, 
ConsistencyGuardConfig.class, String.class, FileSystemRetryConfig.class, 
String.class, String.class, HoodieMetaserverConfig.class},
 conf, basePath, consistencyGuardConfig, recordMergerStrategy, 
fileSystemRetryConfig,
-metaserverConfig.getDatabaseName(), metaserverConfig.getTableName(), 
metaserverConfig)
+Option.of(metaserverConfig.getDatabaseName()), 
Option.of(metaserverConfig.getTableName()), metaserverConfig)
 : new HoodieTableMetaClient(conf, basePath,

Review Comment:
   How could the option be empty? Maybe you should use `Option.ofNullable`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9330:
URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663312118

   
   ## CI report:
   
   * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990)
 
   * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN
   * d6d32a693c455830a31b883915e9940fa309c77f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9324:
URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663312046

   
   ## CI report:
   
   * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9330:
URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663279883

   
   ## CI report:
   
   * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990)
 
   * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1663274558

   
   ## CI report:
   
   * 5b6c8a9f7e241fb76bc7112881e0a9cbbeb07a12 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19012)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


big-doudou commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663269690

   > I think he means check why f#inalizeWrite is not picking up the files to 
be deleted upon commit?
   
   It would be great if there is a lighter solution, otherwise my task still 
needs to be rollback


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


big-doudou commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663268905

   > 我认为他的意思是检查为什么 f#inalizeWrite 没有选择提交时要删除的文件?
   
   yes Because those files are not visible to #getLatestFileSlices


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


voonhous commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663266922

   I think he means check why f#inalizeWrite is not picking up the files to be 
deleted upon commit? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Sagar Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750563#comment-17750563
 ] 

Sagar Sumit commented on HUDI-6596:
---

[~krishen] Overall, your proposed approach seems robust and thoughtful. Few 
considerations:

> Acquire the table lock

The table lock could become a bottleneck, potentially leading to performance 
issues as other operations might be blocked too. It might be useful to consider 
how frequently you expect concurrent rollbacks to occur and whether this might 
create a performance problem.

> check for an active heartbeat for the rollback instant time. If there is one, 
> then abort the rollback as that means there is a concurrent job executing 
> that rollback.

Worth considering edge cases where heartbeats could become stale or be missed 
(e.g., if a job crashes without properly closing its heartbeat). Handling these 
scenarios gracefully will help ensure that rollbacks can still proceed when 
needed. Can we ensure rollbacks are idempotent in case of repeated failures or 
retries?

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan. If that 
> isn't the case, then raise a rollback exception since this means another job 
> has concurrently already executed this plan. Note that in a valid HUDI 
> dataset there can be at most one rollback instant for a corresponding commit 
> instant, which is why if we no longer see a pending rollback in timeline in 
> this phase we can safely assume that it had already been executed to 
> completion.
>  ##  [b] If no pending inflight rollback plan was passed in by caller and no 
> pending rollback instant was found in timeline earlier, then schedule a new 
> rollback plan
>  # Now that a rollback plan and requested rollback instant time has been 
> assigned, check for an active heartbeat for the rollback instant time. If 
> there is one, then abort the rollback as that means there is a concurrent job 
> executing that rollback. If not, then start a heartbeat for that rollback 
> instant time.
>  # Release the table lock
>  # Execute the rollback plan and complete the rollback instant. Regardless of 
> whether this succeeds or fails with an exception, close the heartbeat. This 
> increases the chance that the next job that tries to call this rollback API 
> will follow through with the rollback and not abort due to an active previous 
> heartbeat
>  
>  * These steps will only be enforced for  skipLocking=false, since if  
> skipLocking=true then that means the caller may already be explicitly holding 
> a table lock. In this case, acquiring the lock again in step (1) will fail.
>  * Acquiring a lock and reloading timeline for (1-3) will guard against data 
> race conditions where another job calls this rollback API at same time and 
> schedules its own rollback plan and instant. This is since if no rollback has 
> been attempted before for this instant, then before step (1), there is a 
> window of time where another concurrent rollback job could have scheduled a 
> rollback plan, failed execution, and cleaned up heartbeat, all while 

[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


big-doudou commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663264488

   > 
   
   https://github.com/apache/hudi/pull/9182   You can read danny0405's reply. 
He said that there will be another bootloader for the rollback. I haven't had 
time to test the details. I will check this issue in detail next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


voonhous commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663258600

   Spent 2 more hours looking at this issue:
   
   What happened was that I was testing this on 0.12.1 without this PR: 
https://github.com/apache/hudi/pull/7208
   
   To reproduce this error:
   
   Add the snippet into 
`org.apache.hudi.sink.StreamWriteFunction#flushRemaining`:
   
   ```java
   if (taskID == 0) {
 // trigger a failure
 throw new HoodieException("Intentional failure on taskID 0 thrown to 
invoke partial failover?");
   }
   
   
   Prior to this enhancement, rollbacks will be created whenever a TM fails to 
remove all the partially written files. 
   
   However, after this enhancement rollbacks will not be created unless a job 
is restarted or global failover happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen (#9273)

2023-08-02 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d779fb5aa1 [HUDI-6320] Fix partition parsing in Spark file index for 
custom keygen (#9273)
2d779fb5aa1 is described below

commit 2d779fb5aa1ebfd33676ebf29217f25c60e17d12
Author: Sagar Sumit 
AuthorDate: Thu Aug 3 09:17:38 2023 +0530

[HUDI-6320] Fix partition parsing in Spark file index for custom keygen 
(#9273)
---
 .../scala/org/apache/hudi/HoodieFileIndex.scala| 14 -
 .../apache/hudi/SparkHoodieTableFileIndex.scala| 13 ++--
 .../scala/org/apache/hudi/cdc/HoodieCDCRDD.scala   |  2 +-
 .../org/apache/hudi/TestHoodieFileIndex.scala  | 34 ---
 .../apache/hudi/functional/TestCOWDataSource.scala | 69 +-
 5 files changed, 99 insertions(+), 33 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
index 3767b65a8ce..a7e90b2fe50 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
@@ -79,7 +79,7 @@ case class HoodieFileIndex(spark: SparkSession,
 spark = spark,
 metaClient = metaClient,
 schemaSpec = schemaSpec,
-configProperties = getConfigProperties(spark, options),
+configProperties = getConfigProperties(spark, options, metaClient),
 queryPaths = HoodieFileIndex.getQueryPaths(options),
 specifiedQueryInstant = 
options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant),
 fileStatusCache = fileStatusCache
@@ -324,7 +324,7 @@ object HoodieFileIndex extends Logging {
 schema.fieldNames.filter { colName => refs.exists(r => 
resolver.apply(colName, r.name)) }
   }
 
-  def getConfigProperties(spark: SparkSession, options: Map[String, String]) = 
{
+  def getConfigProperties(spark: SparkSession, options: Map[String, String], 
metaClient: HoodieTableMetaClient) = {
 val sqlConf: SQLConf = spark.sessionState.conf
 val properties = TypedProperties.fromMap(options.filter(p => p._2 != 
null).asJava)
 
@@ -342,6 +342,16 @@ object HoodieFileIndex extends Logging {
 if (listingModeOverride != null) {
   
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 listingModeOverride)
 }
+val partitionColumns = metaClient.getTableConfig.getPartitionFields
+if (partitionColumns.isPresent) {
+  // NOTE: Multiple partition fields could have non-encoded slashes in the 
partition value.
+  //   We might not be able to properly parse partition-values from 
the listed partition-paths.
+  //   Fallback to eager listing in this case.
+  if (partitionColumns.get().length > 1
+&& (listingModeOverride == null || 
DataSourceReadOptions.FILE_INDEX_LISTING_MODE_LAZY.equals(listingModeOverride)))
 {
+
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 DataSourceReadOptions.FILE_INDEX_LISTING_MODE_EAGER)
+  }
+}
 
 properties
   }
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
index 35ef3e9f066..b3d9e5659e8 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
@@ -29,11 +29,9 @@ import org.apache.hudi.common.model.{FileSlice, 
HoodieTableQueryType}
 import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY
-import org.apache.hudi.hadoop.CachingPath
-import org.apache.hudi.hadoop.CachingPath.createRelativePathUnsafe
 import org.apache.hudi.internal.schema.Types.RecordType
 import org.apache.hudi.internal.schema.utils.Conversions
-import org.apache.hudi.keygen.{StringPartitionPathFormatter, 
TimestampBasedAvroKeyGenerator, TimestampBasedKeyGenerator}
+import org.apache.hudi.keygen.{CustomAvroKeyGenerator, CustomKeyGenerator, 
StringPartitionPathFormatter, TimestampBasedAvroKeyGenerator, 
TimestampBasedKeyGenerator}
 import org.apache.hudi.util.JFunction
 import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.internal.Logging
@@ -44,7 +42,6 @@ import org.apache.spark.sql.catalyst.{InternalRow, 
expressions}
 import org.apache.spark.sql.execution.datasources.{FileStatusCache, NoopCache}
 import 

[GitHub] [hudi] codope merged pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen

2023-08-02 Thread via GitHub


codope merged PR #9273:
URL: https://github.com/apache/hudi/pull/9273


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9327:
URL: https://github.com/apache/hudi/pull/9327#issuecomment-1663249477

   
   ## CI report:
   
   * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN
   * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003)
 
   * 9f6586fa89ccbb464f282c46df781f0280a14762 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19024)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9327:
URL: https://github.com/apache/hudi/pull/9327#issuecomment-1663245042

   
   ## CI report:
   
   * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN
   * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003)
 
   * 9f6586fa89ccbb464f282c46df781f0280a14762 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9273:
URL: https://github.com/apache/hudi/pull/9273#issuecomment-1663244910

   
   ## CI report:
   
   * 3b54d26d8787cdb0cc1bccd86bcaa2e40b3d94a7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18981)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9350: [HUDI-2141] Support flink read metrics

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9350:
URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663240644

   
   ## CI report:
   
   * f36281ccc97ad7a566fd73ddc40543e573ce68b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19022)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9273:
URL: https://github.com/apache/hudi/pull/9273#issuecomment-1663240436

   
   ## CI report:
   
   * 3b54d26d8787cdb0cc1bccd86bcaa2e40b3d94a7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zbbkeepgoing opened a new issue, #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.

2023-08-02 Thread via GitHub


zbbkeepgoing opened a new issue, #9351:
URL: https://github.com/apache/hudi/issues/9351

   **Describe the problem you faced**
   
   - Our scenario
   
   We have 700 million records in our original offline table, distributed 
across 10 partitions. Each partition has a different data size, ranging from 
10GB to 200GB.  We plan to ingest this data into a data lake and test the point 
query performance after applying Clustering.
   
   - Point query scenario
   
   The original table has a column called "vin," which will be used as a filter 
along with the time partition column for point queries.
   
   - Hudi configuration
   
   ```
   hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, 
consistent with Delta Lake's default value.
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.plan.strategy.sort.columns=vin
   hoodie.clustering.rollback.pending.replacecommit.on.conflict=true
   hoodie.clustering.plan.strategy.daybased.lookback.partitions=10
   hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
   hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614
   hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623
   hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184
   hoodie.clustering.plan.strategy.max.num.groups=128
   hoodie.layout.optimize.enable=true
   hoodie.layout.optimize.strategy=z-order
   ```
   
   - Phenomena we observed
   
   1. After Clustering, both Hudi and Delta Lake produce Parquet files of 
approximately 1GB, with an error margin of around 200MB.
   
   2. With Clustering applied, when performing point queries, Hudi scans around 
10 files in partitions with larger data, while Delta Lake typically scans only 
1-2 files regardless of the partition.
   
   3. We conducted performance tests with 10 concurrent and 1 concurrent 
queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with 
different combinations of "vin" and time partition columns. The final 
conclusion was that Delta Lake performs three times better than Hudi.
   
   After examining Hudi's List file code, we found that Hudi primarily uses 
column statistics (min and max values) to retrieve candidate files. Therefore, 
we believe that the List file logic itself is unlikely to be the cause of the 
performance lag. It is highly likely that the issue lies in the Clustering 
algorithm itself.
   
   Can you please analyze from a professional perspective what is the reason 
behind this? Because it determines which data lake technology we ultimately 
choose.
   
   **Expected behavior**
   
   The point query performance after clustering is comparable to Delta Lake.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.3
   
   * Hive version :  2.3.9
   
   * Hadoop version :  2.x
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 commented on a diff in pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-02 Thread via GitHub


eric9204 commented on code in PR #9327:
URL: https://github.com/apache/hudi/pull/9327#discussion_r1282589478


##
hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieRecordDelegate.java:
##
@@ -70,4 +78,24 @@ public void testKryoSerializeDeserialize() {
 assertEquals(new HoodieRecordLocation("001", "file01"), 
hoodieRecordDelegate.getCurrentLocation().get());
 assertEquals(new HoodieRecordLocation("001", "file-01"), 
hoodieRecordDelegate.getNewLocation().get());
   }
+
+  public Kryo getKryoInstance() {
+final Kryo kryo = new Kryo();
+// This instance of Kryo should not require prior registration of classes
+kryo.setRegistrationRequired(false);
+kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
+// Handle cases where we may have an odd classloader setup like with 
libjars
+// for hadoop
+kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
+
+// Register Hudi's classes
+new HoodieCommonKryoRegistrar().registerClasses(kryo);
+
+// Register serializers
+kryo.register(Utf8.class, new SerializationUtils.AvroUtf8Serializer());
+kryo.register(GenericData.Fixed.class, new GenericAvroSerializer<>());

Review Comment:
   No, the member variable types of `HoodieRecordDelegate` don't contain 
avro.that should't be registered. 
   Has been updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9350: [HUDI-2141] Support flink read metrics

2023-08-02 Thread via GitHub


danny0405 commented on code in PR #9350:
URL: https://github.com/apache/hudi/pull/9350#discussion_r1282587487


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java:
##
@@ -262,9 +268,16 @@ public void snapshotState(FunctionSnapshotContext context) 
throws Exception {
 this.instantState.clear();
 if (this.issuedInstant != null) {
   this.instantState.add(this.issuedInstant);
+  this.readMetrics.setIssuedInstant(this.issuedInstant);
 }
 if (this.issuedOffset != null) {

Review Comment:
   Does the metrics got updated for each read?



##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadOperator.java:
##
@@ -168,6 +174,8 @@ private void processSplits() throws IOException {
   currentSplitState = SplitState.IDLE;
 }
 
+readMetrics.setSplitLatestCommit(split.getLatestCommit());
+

Review Comment:
   Does the metrics got updated for each read?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Pass prepped boolean correctly in sql writer (#9320)

2023-08-02 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 62a9279d666 [MINOR] Pass prepped boolean correctly in sql writer 
(#9320)
62a9279d666 is described below

commit 62a9279d46fd7abe1872857ea2f94fdedd46
Author: Sagar Sumit 
AuthorDate: Thu Aug 3 08:22:59 2023 +0530

[MINOR] Pass prepped boolean correctly in sql writer (#9320)
---
 .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala |  3 +--
 .../sql/hudi/command/MergeIntoHoodieTableCommand.scala   | 16 
 .../hudi/TestMergeIntoTableWithNonRecordKeyField.scala   |  3 ---
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index fcee3fdab49..07b16e1e47d 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -404,8 +404,7 @@ object HoodieSparkSqlWriter {
 hoodieRecords
   }
 client.startCommitWithTime(instantTime, commitActionType)
-val writeResult = DataSourceUtils.doWriteOperation(client, 
dedupedHoodieRecords, instantTime, operation,
-  isPrepped)
+val writeResult = DataSourceUtils.doWriteOperation(client, 
dedupedHoodieRecords, instantTime, operation, isPrepped)
 (writeResult, client)
 }
 
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
index eba75c95452..f830c552bc8 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
@@ -24,8 +24,8 @@ import 
org.apache.hudi.HoodieSparkSqlWriter.CANONICALIZE_NULLABLE
 import org.apache.hudi.avro.HoodieAvroUtils
 import org.apache.hudi.common.model.HoodieAvroRecordMerger
 import org.apache.hudi.common.util.StringUtils
-import org.apache.hudi.config.HoodieWriteConfig.{AVRO_SCHEMA_VALIDATE_ENABLE, 
SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP, TBL_NAME}
 import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.config.HoodieWriteConfig.{AVRO_SCHEMA_VALIDATE_ENABLE, 
SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP, TBL_NAME}
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.hive.HiveSyncConfigHolder
 import org.apache.hudi.sync.common.HoodieSyncConfig
@@ -342,7 +342,9 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
 val tableMetaCols = mergeInto.targetTable.output.filter(a => 
isMetaField(a.name))
 val joinData = 
sparkAdapter.getCatalystPlanUtils.createMITJoin(mergeInto.sourceTable, 
mergeInto.targetTable, LeftOuter, Some(mergeInto.mergeCondition), "NONE")
 val incomingDataCols = 
joinData.output.filterNot(mergeInto.targetTable.outputSet.contains)
-val projectedJoinPlan = if 
(sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), 
SPARK_SQL_OPTIMIZED_WRITES.defaultValue()) == "true") {
+// for pkless table, we need to project the meta columns
+val hasPrimaryKey = 
hoodieCatalogTable.tableConfig.getRecordKeyFields.isPresent
+val projectedJoinPlan = if (!hasPrimaryKey || 
sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), 
"false") == "true") {
   Project(tableMetaCols ++ incomingDataCols, joinData)
 } else {
   Project(incomingDataCols, joinData)
@@ -619,12 +621,10 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
 // default value ("ts")
 // TODO(HUDI-3456) clean up
 val preCombineField = hoodieCatalogTable.preCombineKey.getOrElse("")
-
 val hiveSyncConfig = buildHiveSyncConfig(sparkSession, hoodieCatalogTable, 
tableConfig)
-
-val enableOptimizedMerge = 
sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(),
-  SPARK_SQL_OPTIMIZED_WRITES.defaultValue())
-
+// for pkless tables, we need to enable optimized merge
+val hasPrimaryKey = tableConfig.getRecordKeyFields.isPresent
+val enableOptimizedMerge = if (!hasPrimaryKey) "true" else 
sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), 
"false")
 val keyGeneratorClassName = if (enableOptimizedMerge == "true") {
   classOf[MergeIntoKeyGenerator].getCanonicalName
 } else {
@@ -653,7 +653,7 @@ case class 

[GitHub] [hudi] codope merged pull request #9320: [MINOR] Infer prepped boolean correctly and disable prepped write for MergeInto

2023-08-02 Thread via GitHub


codope merged PR #9320:
URL: https://github.com/apache/hudi/pull/9320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6569] Fix write failure for Avro Enum type (#9237)

2023-08-02 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4017b96fb1b [HUDI-6569] Fix write failure for Avro Enum type (#9237)
4017b96fb1b is described below

commit 4017b96fb1bf47283d0d16deea28fb5dc806d8eb
Author: Y Ethan Guo 
AuthorDate: Wed Aug 2 19:52:02 2023 -0700

[HUDI-6569] Fix write failure for Avro Enum type (#9237)

- Fix a regression for Avro ENUM type.
- Adds logic to handle ENUM type in 
`HoodieAvroUtils.rewriteRecordWithNewSchemaInternal` and `AvroDeserializer`.
---
 .../commit/TestJavaCopyOnWriteActionExecutor.java  |   3 +-
 .../src/test/resources/testDataGeneratorSchema.txt | 132 
 .../commit/TestCopyOnWriteActionExecutor.java  |   3 +-
 .../GenericRecordValidationTestUtils.java  |   5 +
 .../src/test/resources/testDataGeneratorSchema.txt | 132 
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  18 +-
 .../common/testutils/HoodieTestDataGenerator.java  |   9 +-
 .../apache/hudi/common/util/TestAvroOrcUtils.java  |   5 +-
 .../apache/spark/sql/avro/AvroDeserializer.scala   |   1 +
 .../apache/spark/sql/avro/AvroDeserializer.scala   |   1 +
 .../apache/spark/sql/avro/AvroDeserializer.scala   |   1 +
 .../apache/spark/sql/avro/AvroDeserializer.scala   |  15 +-
 .../apache/spark/sql/avro/AvroDeserializer.scala   |  13 +-
 .../apache/spark/sql/avro/AvroDeserializer.scala   |   1 +
 .../streamer-config/source-flattened.avsc  | 101 +
 .../src/test/resources/streamer-config/source.avsc | 228 +++-
 .../resources/streamer-config/source_evolved.avsc  |   4 +
 .../source_evolved_post_processed.avsc |   4 +
 .../streamer-config/sql-transformer.properties |   2 +-
 .../streamer-config/target-flattened.avsc  | 108 ++
 .../src/test/resources/streamer-config/target.avsc | 235 -
 21 files changed, 451 insertions(+), 570 deletions(-)

diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java
index a272585b360..f57b21d89be 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java
@@ -408,11 +408,10 @@ public class TestJavaCopyOnWriteActionExecutor extends 
HoodieJavaClientTestHarne
 
   @Test
   public void testInsertUpsertWithHoodieAvroPayload() throws Exception {
-Schema schema = 
getSchemaFromResource(TestJavaCopyOnWriteActionExecutor.class, 
"/testDataGeneratorSchema.txt");
 HoodieWriteConfig config = HoodieWriteConfig.newBuilder()
 .withEngineType(EngineType.JAVA)
 .withPath(basePath)
-.withSchema(schema.toString())
+.withSchema(TRIP_EXAMPLE_SCHEMA)
 .withStorageConfig(HoodieStorageConfig.newBuilder()
 .parquetMaxFileSize(1000 * 1024).hfileMaxFileSize(1000 * 
1024).build())
 .build();
diff --git 
a/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt 
b/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt
deleted file mode 100644
index c80365b76ea..000
--- 
a/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt
+++ /dev/null
@@ -1,132 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-{
-  "type" : "record",
-  "name" : "triprec",
-  "fields" : [
-  {
-"name" : "timestamp",
-"type" : "long"
-  }, {
-"name" : "_row_key",
-"type" : "string"
-  }, {
- "name" : "partition_path",
- "type" : ["null", "string"],
- "default": null
-  }, {
-"name" : "rider",
-"type" : "string"
-  }, {
-"name" : "driver",
-"type" : "string"
-  }, {
-"name" : "begin_lat",
-"type" : "double"
-  }, {
-"name" : "begin_lon",
-"type" : "double"
-  }, {
-   

[GitHub] [hudi] codope merged pull request #9237: [HUDI-6569] Fix write failure for Avro Enum type

2023-08-02 Thread via GitHub


codope merged PR #9237:
URL: https://github.com/apache/hudi/pull/9237


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9350: [HUDI-2141] Support flink read metrics

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9350:
URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663214407

   
   ## CI report:
   
   * f36281ccc97ad7a566fd73ddc40543e573ce68b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 closed pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-02 Thread via GitHub


xuzifu666 closed pull request #9349: [MINOR] JSR dependency not used in 
spark3.3 version
URL: https://github.com/apache/hudi/pull/9349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663209721

   
   ## CI report:
   
   * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9336: [HUDI-6629] - Changes for s3/gcs IncrSource job to taken into sourceLimit during ingestion

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9336:
URL: https://github.com/apache/hudi/pull/9336#issuecomment-1663209678

   
   ## CI report:
   
   * 77d7b455ee5cd668a005f6f7e6f04135608f2b7a UNKNOWN
   * 1af3c1cd31e9ec695e98e8c2f58cb6ed03ce6dc4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19009)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 commented on pull request #9350: Support flink read metrics

2023-08-02 Thread via GitHub


stream2000 commented on PR #9350:
URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663209301

   @danny0405 Could you help review this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] stream2000 opened a new pull request, #9350: Support flink read metrics

2023-08-02 Thread via GitHub


stream2000 opened a new pull request, #9350:
URL: https://github.com/apache/hudi/pull/9350

   ### Change Logs
   
   Subtask for HUDI-2141, support flink read metrics
   
   stream write metrics and compaction metrics see #9118 
   
   ### Impact
   
   add some metrics
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   Will update document after merge
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS

2023-08-02 Thread via GitHub


danny0405 commented on issue #8848:
URL: https://github.com/apache/hudi/issues/8848#issuecomment-1663205818

   Yeah, maybe it's my fault, we do not exclude calcite when packaging the 
bundle with hive-exec, maybe for some Hive version since 3.x, the calcite 
related classes are required, but the hive-exec itself does not include the 
calcite, do you package by using the same verison hive-exec as your hive server?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663204878

   
   ## CI report:
   
   * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9337:
URL: https://github.com/apache/hudi/pull/9337#issuecomment-1663204832

   
   ## CI report:
   
   * 9cbb48c5cad3d7b467a05eee5a692900539ed863 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19010)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9237: [HUDI-6569] Fix write failure for Avro Enum type

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9237:
URL: https://github.com/apache/hudi/pull/9237#issuecomment-1663204597

   
   ## CI report:
   
   * a30830bcec5f907c190d3349be68297f72a158c1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18988)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils

2023-08-02 Thread via GitHub


xuzifu666 commented on PR #9001:
URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663203677

   > > what is your hudi version?
   > 
   > @xuzifu666, I am building hudi from the master branch.
   > 
   > > and from the stack, and jar submit,it maybe your user jar contains jsr 
depency and version is too low
   > 
   > I don't have a user jar. Everything here is hudi codebase. I am just 
trying to run the integration tests from command line. The only dependency I 
see is on jackson 2.10
   > 
   > `mvn clean dependency:tree -Dincludes=com.fasterxml.jackson.datatype 
-Pintegration-tests`
   > 
   > This has to do something with the runtime setup. Note the package name in 
`NoClassDefFoundError` message. It is looking for `JavaTimeModule` in the wrong 
package somehow:
   > 
   > `java.lang.NoClassDefFoundError: 
org/apache/hudi/com/fasterxml/jackson/datatype/jsr310/JavaTimeModule`
   https://github.com/apache/hudi/pull/9349  can this pr resolve your problem? 
@amrishlal 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer

2023-08-02 Thread via GitHub


danny0405 commented on PR #9199:
URL: https://github.com/apache/hudi/pull/9199#issuecomment-1663203592

   @leesf , is it good to land now ? We still have 2 days for the 0.14.0 code 
freeze.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table

2023-08-02 Thread via GitHub


SteNicholas commented on PR #9287:
URL: https://github.com/apache/hudi/pull/9287#issuecomment-1663192615

   @danny0405, the current behavior and config is consistent with Spark insert 
overwrite. PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-02 Thread via GitHub


xuzifu666 commented on PR #9349:
URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663176001

   cc @xushiyan  have a review please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 opened a new pull request, #9349: [MINOR] JSR dependency not used in spark3.3 version

2023-08-02 Thread via GitHub


xuzifu666 opened a new pull request, #9349:
URL: https://github.com/apache/hudi/pull/9349

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   JSR dependency not used in spark3.3 version
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   none
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table

2023-08-02 Thread via GitHub


danny0405 commented on PR #7580:
URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663174983

   > have you done some work on this
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table

2023-08-02 Thread via GitHub


Zouxxyy commented on PR #7580:
URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663173182

   @danny0405
   > That's true, we should optimize the archiving of cleaning and rollback.
   I see you are working on LSM tree based archive timeline, have you done some 
work on this? 
   If not, I'd like to work for it, the current process of 
`getInstantsToArchive` is a bit complicated, I will sort it out as a whole.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1663165947

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013)
 
   * 293ae466c121508e2e1d0b32c384c99ea1eea707 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19018)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6627) Spark write client fails when write schema is null

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6627:
-
Fix Version/s: 0.14.0

> Spark write client fails when write schema is null
> --
>
> Key: HUDI-6627
> URL: https://issues.apache.org/jira/browse/HUDI-6627
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinish Reddy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When source returns an empty option in deltastreamer, the writer schema is 
> null. This causes an NPE with the table schema validation in spark write 
> client causing the below exception. We should skip this validation when 
> writer schema is null. 
> {code:java}
> org.apache.hudi.exception.HoodieInsertException: Failed insert schema 
> compability check.
>   at 
> org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:851)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:185)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:690)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:396)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.ingestOnce(HoodieDeltaStreamer.java:876)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>   at 
> com.onehouse.hudi.OnehouseDeltaStreamer$MultiTableSyncService.lambda$null$1(OnehouseDeltaStreamer.java:319)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to read 
> schema/check compatibility for base path 
> s3a://onehouse-customer-bucket-2451e78f/data-lake/chandra_data_lake_default/xml_flatten_struct_test
>   at 
> org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:830)
>   at 
> org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:849)
>   ... 10 more
> Caused by: java.lang.NullPointerException
>   at 
> com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1158)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1418)
>   at 
> org.apache.hudi.avro.HoodieAvroUtils.createHoodieWriteSchema(HoodieAvroUtils.java:302)
>   at 
> org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:826)
>   ... 11 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6627) Spark write client fails when write schema is null

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6627.

Resolution: Fixed

Fixed via master branch: 95d0fb5d3276936a3638baed31edc4d9fe0d1f34

> Spark write client fails when write schema is null
> --
>
> Key: HUDI-6627
> URL: https://issues.apache.org/jira/browse/HUDI-6627
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinish Reddy
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When source returns an empty option in deltastreamer, the writer schema is 
> null. This causes an NPE with the table schema validation in spark write 
> client causing the below exception. We should skip this validation when 
> writer schema is null. 
> {code:java}
> org.apache.hudi.exception.HoodieInsertException: Failed insert schema 
> compability check.
>   at 
> org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:851)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:185)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:690)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:396)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.ingestOnce(HoodieDeltaStreamer.java:876)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>   at 
> com.onehouse.hudi.OnehouseDeltaStreamer$MultiTableSyncService.lambda$null$1(OnehouseDeltaStreamer.java:319)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:750)
> Caused by: org.apache.hudi.exception.HoodieException: Failed to read 
> schema/check compatibility for base path 
> s3a://onehouse-customer-bucket-2451e78f/data-lake/chandra_data_lake_default/xml_flatten_struct_test
>   at 
> org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:830)
>   at 
> org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:849)
>   ... 10 more
> Caused by: java.lang.NullPointerException
>   at 
> com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1158)
>   at org.apache.avro.Schema$Parser.parse(Schema.java:1418)
>   at 
> org.apache.hudi.avro.HoodieAvroUtils.createHoodieWriteSchema(HoodieAvroUtils.java:302)
>   at 
> org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:826)
>   ... 11 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6627] Fix NPE when spark client writer schema is null (#9335)

2023-08-02 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 95d0fb5d327 [HUDI-6627] Fix NPE when spark client writer schema is 
null (#9335)
95d0fb5d327 is described below

commit 95d0fb5d3276936a3638baed31edc4d9fe0d1f34
Author: Vinish Reddy 
AuthorDate: Thu Aug 3 06:39:13 2023 +0530

[HUDI-6627] Fix NPE when spark client writer schema is null (#9335)
---
 .../java/org/apache/hudi/table/HoodieTable.java|  5 +-
 .../hudi/testutils/HoodieClientTestBase.java   |  6 +-
 .../apache/hudi/functional/TestWriteClient.java| 87 ++
 3 files changed, 96 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
index 71295098f03..12584be55a4 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
@@ -62,6 +62,7 @@ import 
org.apache.hudi.common.table.view.TableFileSystemView.SliceView;
 import org.apache.hudi.common.util.ClusteringUtils;
 import org.apache.hudi.common.util.Functions;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.ValidationUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.config.HoodieWriteConfig;
@@ -825,7 +826,9 @@ public abstract class HoodieTable implements 
Serializable {
 boolean shouldValidate = config.shouldValidateAvroSchema();
 boolean allowProjection = config.shouldAllowAutoEvolutionColumnDrop();
 if ((!shouldValidate && allowProjection)
-|| 
getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) {
+|| 
getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()
+|| StringUtils.isNullOrEmpty(config.getSchema())
+) {
   // Check not required
   return;
 }
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java
index 454236b4278..569e8d36d89 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java
@@ -158,7 +158,7 @@ public class HoodieClientTestBase extends 
HoodieClientTestHarness {
*/
   public HoodieWriteConfig.Builder getConfigBuilder(String schemaStr, 
IndexType indexType,
 
HoodieFailedWritesCleaningPolicy cleaningPolicy) {
-return 
HoodieWriteConfig.newBuilder().withPath(basePath).withSchema(schemaStr)
+HoodieWriteConfig.Builder builder = 
HoodieWriteConfig.newBuilder().withPath(basePath)
 .withParallelism(2, 
2).withBulkInsertParallelism(2).withFinalizeWriteParallelism(2).withDeleteParallelism(2)
 .withTimelineLayoutVersion(TimelineLayoutVersion.CURR_VERSION)
 .withWriteStatusClass(MetadataMergeWriteStatus.class)
@@ -172,6 +172,10 @@ public class HoodieClientTestBase extends 
HoodieClientTestHarness {
 .withEnableBackupForRemoteFileSystemView(false) // Fail test if 
problem connecting to timeline-server
 .withRemoteServerPort(timelineServicePort)
 
.withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build());
+if (StringUtils.nonEmpty(schemaStr)) {
+  builder.withSchema(schemaStr);
+}
+return builder;
   }
 
   public HoodieSparkTable getHoodieTable(HoodieTableMetaClient metaClient, 
HoodieWriteConfig config) {
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java
 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java
new file mode 100644
index 000..7acf6b2b6b0
--- /dev/null
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * 

[GitHub] [hudi] danny0405 merged pull request #9335: [HUDI-6627] Fix NPE when spark client writer schema is null

2023-08-02 Thread via GitHub


danny0405 merged PR #9335:
URL: https://github.com/apache/hudi/pull/9335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-02 Thread via GitHub


danny0405 commented on code in PR #9327:
URL: https://github.com/apache/hudi/pull/9327#discussion_r1282541082


##
hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieRecordDelegate.java:
##
@@ -70,4 +78,24 @@ public void testKryoSerializeDeserialize() {
 assertEquals(new HoodieRecordLocation("001", "file01"), 
hoodieRecordDelegate.getCurrentLocation().get());
 assertEquals(new HoodieRecordLocation("001", "file-01"), 
hoodieRecordDelegate.getNewLocation().get());
   }
+
+  public Kryo getKryoInstance() {
+final Kryo kryo = new Kryo();
+// This instance of Kryo should not require prior registration of classes
+kryo.setRegistrationRequired(false);
+kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
+// Handle cases where we may have an odd classloader setup like with 
libjars
+// for hadoop
+kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
+
+// Register Hudi's classes
+new HoodieCommonKryoRegistrar().registerClasses(kryo);
+
+// Register serializers
+kryo.register(Utf8.class, new SerializationUtils.AvroUtf8Serializer());
+kryo.register(GenericData.Fixed.class, new GenericAvroSerializer<>());

Review Comment:
   Do we need a registration for avro classes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


danny0405 commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1282536263


##
pom.xml:
##
@@ -98,8 +98,6 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
-
-

Review Comment:
   @amrishlal You are right, if Hudi bundle jar shade the class anyway, we 
should always include the jar in the bundle, or any reference to the JSR class 
could encounter class not found exception.
   
   Another choice is we do not shade the JSR clazz, do we need a shade here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6615) Fix append mode and BulkInsertWriterHelper in flink

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6615.

Resolution: Fixed

Fixed via master branch: 9f2087a89443e93079d061fd81bf2f768f9c6953

> Fix append mode and BulkInsertWriterHelper in flink 
> 
>
> Key: HUDI-6615
> URL: https://issues.apache.org/jira/browse/HUDI-6615
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6615) Fix append mode and BulkInsertWriterHelper in flink

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6615:
-
Fix Version/s: 0.14.0

> Fix append mode and BulkInsertWriterHelper in flink 
> 
>
> Key: HUDI-6615
> URL: https://issues.apache.org/jira/browse/HUDI-6615
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper (#9314)

2023-08-02 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9f2087a8944 [HUDI-6615] Fix the condition of isInputSorted in 
BulkInsertWriterHelper (#9314)
9f2087a8944 is described below

commit 9f2087a89443e93079d061fd81bf2f768f9c6953
Author: Zouxxyy 
AuthorDate: Thu Aug 3 08:50:31 2023 +0800

[HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper 
(#9314)
---
 .../apache/hudi/configuration/OptionsResolver.java |  8 
 .../hudi/sink/bulk/BulkInsertWriterHelper.java |  3 ++-
 .../java/org/apache/hudi/sink/utils/Pipelines.java | 11 ++-
 .../apache/hudi/streamer/HoodieFlinkStreamer.java  |  2 +-
 .../org/apache/hudi/table/HoodieTableSink.java |  5 ++---
 .../apache/hudi/sink/ITTestDataStreamWrite.java|  2 +-
 .../hudi/sink/bucket/ITTestBucketStreamWrite.java  | 23 +-
 .../bucket/ITTestConsistentBucketStreamWrite.java  |  5 ++---
 8 files changed, 19 insertions(+), 40 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
index 8f4b013de04..944e795dc2f 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java
@@ -76,6 +76,14 @@ public class OptionsResolver {
 return operationType == WriteOperationType.INSERT;
   }
 
+  /**
+   * Returns whether the table operation is 'bulk_insert'.
+   */
+  public static boolean isBulkInsertOperation(Configuration conf) {
+WriteOperationType operationType = 
WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION));
+return operationType == WriteOperationType.BULK_INSERT;
+  }
+
   /**
* Returns whether it is a MERGE_ON_READ table.
*/
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java
index 56f668e32f0..3c0d4fb7662 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java
@@ -22,6 +22,7 @@ import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.OptionsResolver;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.io.storage.row.HoodieRowDataCreateHandle;
 import org.apache.hudi.table.HoodieTable;
@@ -84,7 +85,7 @@ public class BulkInsertWriterHelper {
 this.taskEpochId = taskEpochId;
 this.rowType = preserveHoodieMetadata ? rowType : 
addMetadataFields(rowType, writeConfig.allowOperationMetadataField()); // patch 
up with metadata fields
 this.preserveHoodieMetadata = preserveHoodieMetadata;
-this.isInputSorted = 
conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT);
+this.isInputSorted = OptionsResolver.isBulkInsertOperation(conf) && 
conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT);
 this.fileIdPrefix = UUID.randomUUID().toString();
 this.keyGen = preserveHoodieMetadata ? null : RowDataKeyGen.instance(conf, 
rowType);
   }
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
index 5d945d07aa1..fe51fe435e1 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
@@ -202,19 +202,12 @@ public class Pipelines {
* @param conf   The configuration
* @param rowTypeThe input row type
* @param dataStream The input data stream
-   * @param boundedWhether the input stream is bounded
* @return the appending data stream sink
*/
   public static DataStream append(
   Configuration conf,
   RowType rowType,
-  DataStream dataStream,
-  boolean bounded) {
-if (!bounded) {
-  // In principle, the config should be immutable, but the boundedness
-  // is only visible when creating the sink pipeline.
-  conf.setBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT, false);
-}
+  DataStream dataStream) {
 WriteOperatorFactory operatorFactory = 
AppendWriteOperator.getFactory(conf, rowType);
 
 return dataStream
@@ -469,7 +462,7 @@ public class Pipelines {
 }
 

[GitHub] [hudi] danny0405 merged pull request #9314: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper

2023-08-02 Thread via GitHub


danny0405 merged PR #9314:
URL: https://github.com/apache/hudi/pull/9314


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xuzifu666 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils

2023-08-02 Thread via GitHub


xuzifu666 commented on PR #9001:
URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663139322

   > > need close JavaTimeModule For JsonUtils
   > 
   > @xuzifu666 can you help me understand what the PR title means?
   
   when use spark version leq 3.2,would report class not found for jsr,in that 
time is a ToDo fix,so move the judge to sparkadater,spark can run rightly 
@xushiyan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9332: [HUDI-6625] Lazy create metadata and viewManager in HoodieTable

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9332:
URL: https://github.com/apache/hudi/pull/9332#issuecomment-1663137677

   
   ## CI report:
   
   * daa28a4bd88b29bf80b19210dfb4a54667e07cae UNKNOWN
   * 67e656c397338a14432a09013f007b0840c89db9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19007)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1663137528

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * 87e8f76e3d97d5b3b2fc10fe7704395575cc1b79 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19005)
 
   * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013)
 
   * 293ae466c121508e2e1d0b32c384c99ea1eea707 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table

2023-08-02 Thread via GitHub


danny0405 commented on PR #7580:
URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663133848

   > then these rollback instants will stay in the active timeline forever.
   
   That's true, we should optimize the archiving of cleaning and rollback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils

2023-08-02 Thread via GitHub


danny0405 commented on PR #9001:
URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663133061

   > n the wrong package somehow
   
   Guess it is because some of the spark version has the dependency and in the 
hudi pom, we shaded the clazz, but in some spark version, we do not have this 
dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6632) Revert FileSystemBackedTableMetadata#getAllPartitionPaths improvements due to HUDI-6476

2023-08-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6632:
-
Labels: pull-request-available  (was: )

> Revert FileSystemBackedTableMetadata#getAllPartitionPaths improvements due to 
> HUDI-6476
> ---
>
> Key: HUDI-6632
> URL: https://issues.apache.org/jira/browse/HUDI-6632
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #9343: [HUDI-6632] Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9343:
URL: https://github.com/apache/hudi/pull/9343#issuecomment-1663132727

   
   ## CI report:
   
   * 9d8464b88ac7656685cdd06f74efb6600b7d2250 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19006)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9314: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9314:
URL: https://github.com/apache/hudi/pull/9314#issuecomment-1663132594

   
   ## CI report:
   
   * 416c1dfc455a53bfe1d5367b7ab6d02aabd3a6dd UNKNOWN
   * cec56b320c0b83d49edbc453a9e50934c661d87d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19008)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6466) Spark's capcity of insert overwrite partitioned table with dynamic partition lost

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6466.

Resolution: Fixed

Fixed via master branch: d67455a4a713e295bba1d0a5d338fcfbe5af217e

> Spark's capcity of insert overwrite partitioned table with dynamic partition 
> lost
> -
>
> Key: HUDI-6466
> URL: https://issues.apache.org/jira/browse/HUDI-6466
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Mentioned as [#7365 
> (comment)|https://github.com/apache/hudi/pull/7365#issuecomment-1338371540] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6466) Spark's capcity of insert overwrite partitioned table with dynamic partition lost

2023-08-02 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6466:
-
Fix Version/s: 0.14.0

> Spark's capcity of insert overwrite partitioned table with dynamic partition 
> lost
> -
>
> Key: HUDI-6466
> URL: https://issues.apache.org/jira/browse/HUDI-6466
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Mentioned as [#7365 
> (comment)|https://github.com/apache/hudi/pull/7365#issuecomment-1338371540] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (8da99f8a5c9 -> d67455a4a71)

2023-08-02 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 8da99f8a5c9 [HUDI-6540] Support failed writes clean policy for Flink 
(#9211)
 add d67455a4a71 [HUDI-6466] Fix spark insert overwrite partitioned table 
with dynamic partition (#9113)

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/hudi/DataSourceOptions.scala  |   9 +
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |  84 ++--
 .../command/InsertIntoHoodieTableCommand.scala |  16 +-
 .../apache/spark/sql/hudi/TestInsertTable.scala| 228 +
 4 files changed, 225 insertions(+), 112 deletions(-)



[GitHub] [hudi] danny0405 merged pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-08-02 Thread via GitHub


danny0405 merged PR #9113:
URL: https://github.com/apache/hudi/pull/9113


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition

2023-08-02 Thread via GitHub


danny0405 commented on PR #9113:
URL: https://github.com/apache/hudi/pull/9113#issuecomment-1663129390

   The failed test should be a flaky one: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18976=logs=4c665d41-fe93-5d6b-3716-d7e63fa41849=f7ca1aa0-5550-5ab6-0ee3-d8a5a59e7ac4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on issue #9282: [ISSUE] Hudi 0.13.0. Spark 3.3.2 Deltastreamed table read failure

2023-08-02 Thread via GitHub


amrishlal commented on issue #9282:
URL: https://github.com/apache/hudi/issues/9282#issuecomment-1663107557

   @rmnlchh @ad1happy2go I am looking at the following part of the stack trace:
   
   ```
   Cause: java.lang.IllegalArgumentException: For input string: "null"
   at scala.collection.immutable.StringLike.parseBoolean(StringLike.scala:330)
   at scala.collection.immutable.StringLike.toBoolean(StringLike.scala:289)
   at scala.collection.immutable.StringLike.toBoolean$(StringLike.scala:289)
   at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:33)
   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.(ParquetSchemaConverter.scala:70)
   at 
org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormatHelper$.buildImplicitSchemaChangeInfo(HoodieParquetFileFormatHelper.scala:30)
   ```
   The stack trace seems to indicate that there was a problem while trying to 
convert a string value into boolean (see code line at [spark v3.3.2 
ParquetSchemaConverter.scala:70](https://github.com/apache/spark/blob/5103e00c4ce5fcc4264ca9c4df12295d42557af6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L70)
 which I have pasted below):
   
   `conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean)`
   
   This line seems to indicate that you need to set 
`spark.sql.legacy.parquet.nanosAsLong` to enter 'true' or 'false' to avoid this 
exception from coming up (see definition of 
[LEGACY_PARQUET_NANOS_AS_LONG](https://github.com/apache/spark/blob/5103e00c4ce5fcc4264ca9c4df12295d42557af6/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3462C87-L3462C87)
 here). Please let me know if this doesn't fix the issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663103718

   
   ## CI report:
   
   * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015)
 
   * d2360a5a7de655991202680013d20268ce325666 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19016)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ys8 opened a new issue, #9348: [SUPPORT] hide soft-deleted rows

2023-08-02 Thread via GitHub


ys8 opened a new issue, #9348:
URL: https://github.com/apache/hudi/issues/9348

   If my reading is correct, `SELECT COUNT(*) FROM hudi_table` still includes 
soft-deleted rows. If that's true, is there a way to completely hide 
soft-deleted rows from SELECT queries?
   
   
[https://github.com/apache/hudi/blob/8da99f8a5c9ce3abd5a5a14baf3a8db81c3d39f0/hudi-[…]/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py](https://github.com/apache/hudi/blob/8da99f8a5c9ce3abd5a5a14baf3a8db81c3d39f0/hudi-examples/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py#L144-L185)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663099148

   
   ## CI report:
   
   * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015)
 
   * d2360a5a7de655991202680013d20268ce325666 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9320: [MINOR] Infer prepped boolean correctly and disable prepped write for MergeInto

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9320:
URL: https://github.com/apache/hudi/pull/9320#issuecomment-1663094309

   
   ## CI report:
   
   * 70e8bc9077123ca463bcc5912eb080ef37c36d3f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19004)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


amrishlal commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1282491522


##
packaging/hudi-integ-test-bundle/pom.xml:
##
@@ -319,12 +319,19 @@
 
   com.fasterxml.jackson.module
   jackson-module-scala_${scala.binary.version}
+  ${fasterxml.jackson.module.scala.version}
 
 
 
   com.fasterxml.jackson.dataformat
   jackson-dataformat-yaml
-  2.7.4
+  ${fasterxml.spark3.version}
+
+
+
+  com.fasterxml.jackson.datatype
+  jackson-datatype-jsr310
+  ${fasterxml.spark3.version}

Review Comment:
   Also using `${fasterxml.jackson.module.scala.version}` and 
`${fasterxml.jackson.dataformat.yaml.version}` to pull in the appropriate 
version for `jackson-module-scala_${scala.binary.version}` and 
`jackson-dataformat-yaml`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

2023-08-02 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6242:
-
Description: 
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info


  was:
This EPIC tracks changes to the Hudi storage format.

Format change is anything that changes any bits related to

 - *Timeline* : active or archived timeline contents, file names.
 - *Base Files*: file format versions, any changes to any data types, file 
footers, file names.
 - *Log Files*:  Block structure, content, names. 
 - *Metadata Table*: (should we call this index table instead?) partition 
names, number of file groups, key/value schema and metadata to MDT row 
mappings. 
- *Table properties*: What's written to hoodie.properties.
- *Marker files* : how would we treat these?


The following functionality should be supportable by the new format tech specs 
(at a minimum) 

Flexibility : 
 - Ability to mix different types of base files within a single table or even a 
single file group (e.g images, json, vectors ...) 
 - Easy integration of metadata for JVM and non-jvm clients


Additional Info:
 - Support encoding of watermarks/event time fields as first class citizen, for 
handling late arriving data. 
 - Position based skipping of base file
 - Additional metadata to avoid more RPCs to scan base file/log blocks.
 - ML/Column family use-case?
 - Support having changeset of columns in each write, other headers


Log : 
 - Support writing updates as deletes and inserts, instead of logging as update 
to base file.
 - CDC format is GA.

Table organization:
 - Support different logical partitions on the same data
 - Storage of table spread across buckets/root folders
 - Decouple table location from timeline, metadata. They can all be in 
different places


Concurrency/Timeline: 
 - Ability to support general purpose multi-table transactions, esp between 
data and metadata tables.
 - Support lockless/non-blocking transactions, where writers don't block each 
other even in face of conflicts. 
 - Support for long lived instants in timeline, break down distinction between 
active/archived
 - Support checking of uniqueness constraints, even in face of two concurrent 
insert transactions. 
 - Support precise time-travel queries
 - Support time-travel writes.
 - Support schema history tracking and aid in schema evol impl.
 - TrueTime store/support for instant times

Metadata table :
 - Encode filegroup ID and commit time along with file metadata


Table Properties: 
 - Partitioning information/indexing info



> Format changes for 

[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663065177

   
   ## CI report:
   
   * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9347:
URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663059384

   
   ## CI report:
   
   * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9324:
URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663053455

   
   ## CI report:
   
   * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] mansipp opened a new pull request, #9347: Upgrade aws java sdk to v2

2023-08-02 Thread via GitHub


mansipp opened a new pull request, #9347:
URL: https://github.com/apache/hudi/pull/9347

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


amrishlal commented on PR #9324:
URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663036555

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.

2023-08-02 Thread via GitHub


amrishlal commented on code in PR #9324:
URL: https://github.com/apache/hudi/pull/9324#discussion_r1282461432


##
packaging/hudi-integ-test-bundle/pom.xml:
##
@@ -319,12 +319,19 @@
 
   com.fasterxml.jackson.module
   jackson-module-scala_${scala.binary.version}
+  ${fasterxml.jackson.module.scala.version}
 
 
 
   com.fasterxml.jackson.dataformat
   jackson-dataformat-yaml
-  2.7.4
+  ${fasterxml.spark3.version}
+
+
+
+  com.fasterxml.jackson.datatype
+  jackson-datatype-jsr310
+  ${fasterxml.spark3.version}

Review Comment:
   Based on offline discussion, I modified this to 
`${fasterxml.version}` to pick up the right jackson package 
version for a given spark version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9327:
URL: https://github.com/apache/hudi/pull/9327#issuecomment-1662993536

   
   ## CI report:
   
   * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN
   * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9335: [HUDI-6627] Fix NPE when spark client writer schema is null

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9335:
URL: https://github.com/apache/hudi/pull/9335#issuecomment-1662984338

   
   ## CI report:
   
   * b1091bdeaf25dcd95f567a8e50c2c6d4dc80fb79 UNKNOWN
   * 6386813364fd15848d9e63f4b77ce31c63e8a815 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19001)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9287:
URL: https://github.com/apache/hudi/pull/9287#issuecomment-1662984101

   
   ## CI report:
   
   * 6d171098737180ae6c8dcdf8cfb717e03359b300 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19002)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dat-vikash commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2023-08-02 Thread via GitHub


dat-vikash commented on issue #8892:
URL: https://github.com/apache/hudi/issues/8892#issuecomment-1662943359

   Seeing this in flink 1.16.1 and hudi 0.13.1 with MoR tables and single 
writer (flink) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Krishen Bhan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishen Bhan updated HUDI-6596:
---
Description: 
h1. Issue

The existing rollback API in 0.14 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
 executes a rollback plan, either taking in an existing rollback plan provided 
by the caller for a previous rollback or attempt, or scheduling a new rollback 
instant if none is provided. Currently it is not safe for two concurrent jobs 
to call this API (when skipLocking=False and the callers aren't already holding 
a lock), as this can lead to an issue where multiple rollback requested plans 
are created or two jobs are executing the same rollback instant at the same 
time.
h1. Proposed change

One way to resolve this issue is to refactor this rollback function such that 
if skipLocking=false, the following steps are followed
 # Acquire the table lock
 # Reload the active timeline
 # Look at the active timeline to see if there is a inflight rollback instant 
from a previous rollback attempt, if it exists then assign this is as the 
rollback plan to execute. Also, check if a pending rollback plan was passed in 
by caller. Then it executes the following steps depending on whether the caller 
passed a pending rollback instant plan.
 ##  [a] If a pending inflight rollback plan was passed in by caller, then 
check that there is a previous attempted rollback instant on timeline (and that 
the instant times match) and continue to use this rollback plan. If that isn't 
the case, then raise a rollback exception since this means another job has 
concurrently already executed this plan. Note that in a valid HUDI dataset 
there can be at most one rollback instant for a corresponding commit instant, 
which is why if we no longer see a pending rollback in timeline in this phase 
we can safely assume that it had already been executed to completion.
 ##  [b] If no pending inflight rollback plan was passed in by caller and no 
pending rollback instant was found in timeline earlier, then schedule a new 
rollback plan
 # Now that a rollback plan and requested rollback instant time has been 
assigned, check for an active heartbeat for the rollback instant time. If there 
is one, then abort the rollback as that means there is a concurrent job 
executing that rollback. If not, then start a heartbeat for that rollback 
instant time.
 # Release the table lock
 # Execute the rollback plan and complete the rollback instant. Regardless of 
whether this succeeds or fails with an exception, close the heartbeat. This 
increases the chance that the next job that tries to call this rollback API 
will follow through with the rollback and not abort due to an active previous 
heartbeat

 
 * These steps will only be enforced for  skipLocking=false, since if  
skipLocking=true then that means the caller may already be explicitly holding a 
table lock. In this case, acquiring the lock again in step (1) will fail.
 * Acquiring a lock and reloading timeline for (1-3) will guard against data 
race conditions where another job calls this rollback API at same time and 
schedules its own rollback plan and instant. This is since if no rollback has 
been attempted before for this instant, then before step (1), there is a window 
of time where another concurrent rollback job could have scheduled a rollback 
plan, failed execution, and cleaned up heartbeat, all while the current 
rollback job is running. As a result, even if the current job was passed in an 
empty pending rollback plan, it still needs to check the active timeline to 
ensure that no new rollback pending instant has been created. 
 * Using a heartbeat will signal to other callers in other jobs that there is 
another job already executing this rollback. Checking for expired heartbeat and 
(re)-starting the heartbeat has to be done under a lock, so that multiple jobs 
don't each start it at the same time and assume that they are the only ones 
that are heartbeating. 
 * The table lock is no longer needed after (5), since it can now be safely 
assumed that no other job (calling this rollback API) will execute this 
rollback instant. 

One example implementation to achieve this:

 
{code:java}
@Deprecated
public boolean rollback(final String commitInstantTime, 
Option pendingRollbackInfo, boolean skipLocking,
Option rollbackInstantTimeOpt) throws HoodieRollbackException {
  final Timer.Context timerContext = this.metrics.getRollbackCtx();
  final Option commitInstantOpt;
  final HoodieTable table;
  try {
table = createTable(config, hadoopConf);
  } catch (Exception e) {
throw new HoodieRollbackException("Failed to initalize table for rollback " 
+ config.getBasePath() + " commits " + commitInstantTime, e);
  }
  final String rollbackInstantTime;
  final boolean 

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Krishen Bhan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishen Bhan updated HUDI-6596:
---
Description: 
h1. Issue

The existing rollback API in 0.14 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
 executes a rollback plan, either taking in an existing rollback plan provided 
by the caller for a previous rollback or attempt, or scheduling a new rollback 
instant if none is provided. Currently it is not safe for two concurrent jobs 
to call this API (when skipLocking=False and the callers aren't already holding 
a lock), as this can lead to an issue where multiple rollback requested plans 
are created or two jobs are executing the same rollback instant at the same 
time.
h1. Proposed change

One way to resolve this issue is to refactor this rollback function such that 
if skipLocking=false, the following steps are followed
 # Acquire the table lock
 # Reload the active timeline
 # Look at the active timeline to see if there is a inflight rollback instant 
from a previous rollback attempt, if it exists then assign this is as the 
rollback plan to execute (at first). Also, check if a pending rollback plan was 
passed in by caller. Then it executes the following steps depending on whether 
the caller passed a pending rollback instant plan.
 ##  [a] If a pending inflight rollback plan was passed in by caller, then 
check that there is a previous attempted rollback instant on timeline (and that 
the instant times match) and continue to use this rollback plan. If that isn't 
the case, then raise a rollback exception since this means another job has 
concurrently already executed this plan. Note that in a valid HUDI dataset 
there can be at most one rollback instant for a corresponding commit instant, 
which is why if we no longer see a pending rollback in timeline in this phase 
we can safely assume that it had already been executed to completion.
 ##  [b] If no pending inflight rollback plan was passed in by caller and no 
pending rollback instant was found in timeline earlier, then schedule a new 
rollback plan
 # Now that a rollback plan and requested rollback instant time has been 
assigned, check for an active heartbeat for the rollback instant time. If there 
is one, then abort the rollback as that means there is a concurrent job 
executing that rollback. If not, then start a heartbeat for that rollback 
instant time.
 # Release the table lock
 # Execute the rollback plan and complete the rollback instant. Regardless of 
whether this succeeds or fails with an exception, close the heartbeat. This 
increases the chance that the next job that tries to call this rollback API 
will follow through with the rollback and not abort due to an active previous 
heartbeat

 
 * These steps will only be enforced for  skipLocking=false, since if  
skipLocking=true then that means the caller may already be explicitly holding a 
table lock. In this case, acquiring the lock again in step (1) will fail.
 * Acquiring a lock and reloading timeline for (1-3) will guard against data 
race conditions where another job calls this rollback API at same time and 
schedules its own rollback plan and instant. This is since if no rollback has 
been attempted before for this instant, then before step (1), there is a window 
of time where another concurrent rollback job could have scheduled a rollback 
plan, failed execution, and cleaned up heartbeat, all while the current 
rollback job is running. As a result, even if the current job was passed in an 
empty pending rollback plan, it still needs to check the active timeline to 
ensure that no new rollback pending instant has been created. 
 * Using a heartbeat will signal to other callers in other jobs that there is 
another job already executing this rollback. Checking for expired heartbeat and 
(re)-starting the heartbeat has to be done under a lock, so that multiple jobs 
don't each start it at the same time and assume that they are the only ones 
that are heartbeating. 
 * The table lock is no longer needed after (5), since it can now be safely 
assumed that no other job (calling this rollback API) will execute this 
rollback instant. 

One example implementation to achieve this:

 
{code:java}
@Deprecated
public boolean rollback(final String commitInstantTime, 
Option pendingRollbackInfo, boolean skipLocking,
Option rollbackInstantTimeOpt) throws HoodieRollbackException {
  final Timer.Context timerContext = this.metrics.getRollbackCtx();
  final Option commitInstantOpt;
  final HoodieTable table;
  try {
table = createTable(config, hadoopConf);
  } catch (Exception e) {
throw new HoodieRollbackException("Failed to initalize table for rollback " 
+ config.getBasePath() + " commits " + commitInstantTime, e);
  }
  final String rollbackInstantTime;
  final boolean 

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Krishen Bhan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishen Bhan updated HUDI-6596:
---
Description: 
h1. Issue

The existing rollback API in 0.14 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
 executes a rollback plan, either taking in an existing rollback plan provided 
by the caller for a previous rollback or attempt, or scheduling a new rollback 
instant if none is provided. Currently it is not safe for two concurrent jobs 
to call this API (when skipLocking=False and the callers aren't already holding 
a lock), as this can lead to an issue where multiple rollback requested plans 
are created or two jobs are executing the same rollback instant at the same 
time.
h1. Proposed change

One way to resolve this issue is to refactor this rollback function such that 
if skipLocking=false, the following steps are followed
 # Acquire the table lock
 # Reload the active timeline
 # Look at the active timeline to see if there is a inflight rollback instant 
from a previous rollback attempt, if it exists then assign this is as the 
rollback plan to execute. Also, check if a pending rollback plan was passed in 
by caller. Then it executes the following steps depending on whether the caller 
passed a pending rollback instant plan.
 ##  [a] If a pending inflight rollback plan was passed in by caller, then 
check that there is a previous attempted rollback instant on timeline (and that 
the instant times match) and continue to use this rollback plan. If that isn't 
the case, then raise a rollback exception since this means another job has 
concurrently already executed this plan. Note that in a valid HUDI dataset 
there can be at most one rollback instant for a corresponding commit instant, 
which is why if we no longer see a pending rollback in timeline in this phase 
we can safely assume that it had already been executed to completion.
 ##  [b] If no pending inflight rollback plan was passed in by caller and no 
pending rollback instant was found in timeline earlier, then schedule a new 
rollback plan
 # Now that a rollback plan and requested rollback instant time has been 
assigned, check for an active heartbeat for the rollback instant time. If there 
is one, then abort the rollback as that means there is a concurrent job 
executing that rollback. If not, then start a heartbeat for that rollback 
instant time.
 # Release the table lock
 # Execute the rollback plan and complete the rollback instant. Regardless of 
whether this succeeds or fails with an exception, close the heartbeat. This 
increases the chance that the next job that tries to call this rollback API 
will follow through with the rollback and not abort due to an active previous 
heartbeat

 
 * These steps will only be enforced for  skipLocking=false, since if  
skipLocking=true then that means the caller may already be explicitly holding a 
table lock. In this case, acquiring the lock again in step (1) will fail.
 * Acquiring a lock and reloading timeline for (1-3) will guard against data 
race conditions where another job calls this rollback API at same time and 
schedules its own rollback plan and instant. This is since if no rollback has 
been attempted before for this instant, then before step (1), there is a window 
of time where another concurrent rollback job could have scheduled a rollback 
plan, failed execution, and cleaned up heartbeat, all while the current 
rollback job is running. As a result, even if the current job was passed in an 
empty pending rollback plan, it still needs to check the active timeline to 
ensure that no new rollback pending instant has been created. 
 * Using a heartbeat will signal to other callers in other jobs that there is 
another job already executing this rollback. Checking for expired heartbeat and 
(re)-starting the heartbeat has to be done under a lock, so that multiple jobs 
don't each start it at the same time and assume that they are the only ones 
that are heartbeating. 
 * The table lock is no longer needed after (5), since it can now be safely 
assumed that no other job (calling this rollback API) will execute this 
rollback instant. 

One example implementation to achieve this:

 
{code:java}
@Deprecated
public boolean rollback(final String commitInstantTime, 
Option pendingRollbackInfo, boolean skipLocking,
Option rollbackInstantTimeOpt) throws HoodieRollbackException {
  final Timer.Context timerContext = this.metrics.getRollbackCtx();
  final Option commitInstantOpt;
  final HoodieTable table;
  try {
table = createTable(config, hadoopConf);
  } catch (Exception e) {
throw new HoodieRollbackException("Failed to initalize table for rollback " 
+ config.getBasePath() + " commits " + commitInstantTime, e);
  }
  final String rollbackInstantTime;
  final boolean 

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Krishen Bhan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishen Bhan updated HUDI-6596:
---
Description: 
h1. Issue

The existing rollback API in 0.14 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
 executes a rollback plan, either taking in an existing rollback plan provided 
by the caller for a previous rollback or attempt, or scheduling a new rollback 
instant if none is provided. Currently it is not safe for two concurrent jobs 
to call this API (when skipLocking=False and the callers aren't already holding 
a lock), as this can lead to an issue where multiple rollback requested plans 
are created or two jobs are executing the same rollback instant at the same 
time.
h1. Proposed change

One way to resolve this issue is to refactor this rollback function such that 
if skipLocking=false, the following steps are followed
 # Acquire the table lock
 # Reload the active timeline
 # Look at the active timeline to see if there is a inflight rollback instant 
from a previous rollback attempt, if it exists then assign this is as the 
rollback plan to execute. Also, check if a pending rollback plan was passed in 
by caller. Then it executes the following steps depending on whether the caller 
passed a pending rollback instant plan
 ##  [a] If a pending inflight rollback plan was passed in by caller, then 
check that there is a previous attempted rollback instant on timeline (and that 
the instant times match) and continue to use this rollback plan. If that isn't 
the case, then raise a rollback exception since this means another job has 
concurrently already executed this plan. Note that in a valid HUDI dataset 
there can be at most one rollback instant for a corresponding commit instant, 
which is why if we no longer see a pending rollback in timeline in this phase 
we can safely assume that it had already been executed to completion.
 ##  [b] If no pending inflight rollback plan was passed in by caller then 
schedule a new rollback plan if no pending rollback instant was found in 
timeline earlier.
 # Now that a rollback plan and requested rollback instant time has been 
assigned, check for an active heartbeat for the rollback instant time. If there 
is one, then abort the rollback as that means there is a concurrent job 
executing that rollback. If not, then start a heartbeat for that rollback 
instant time.
 # Release the table lock
 # Execute the rollback plan and complete the rollback instant. Regardless of 
whether this succeeds or fails with an exception, close the heartbeat. This 
increases the chance that the next job that tries to call this rollback API 
will follow through with the rollback and not abort due to an active previous 
heartbeat

 
 * These steps will only be enforced for  skipLocking=false, since if  
skipLocking=true then that means the caller may already be explicitly holding a 
table lock. In this case, acquiring the lock again in step (1) will fail.
 * Acquiring a lock and reloading timeline for (1-3) will guard against data 
race conditions where another job calls this rollback API at same time and 
schedules its own rollback plan and instant. This is since if no rollback has 
been attempted before for this instant, then before step (1), there is a window 
of time where another concurrent rollback job could have scheduled a rollback 
plan, failed execution, and cleaned up heartbeat, all while the current 
rollback job is running. As a result, even if the current job was passed in an 
empty pending rollback plan, it still needs to check the active timeline to 
ensure that no new rollback pending instant has been created. 
 * Using a heartbeat will signal to other callers in other jobs that there is 
another job already executing this rollback. Checking for expired heartbeat and 
(re)-starting the heartbeat has to be done under a lock, so that multiple jobs 
don't each start it at the same time and assume that they are the only ones 
that are heartbeating. 
 * The table lock is no longer needed after (5), since it can now be safely 
assumed that no other job (calling this rollback API) will execute this 
rollback instant. 

One example implementation to achieve this:

 
{code:java}
@Deprecated
public boolean rollback(final String commitInstantTime, 
Option pendingRollbackInfo, boolean skipLocking,
Option rollbackInstantTimeOpt) throws HoodieRollbackException {
  final Timer.Context timerContext = this.metrics.getRollbackCtx();
  final Option commitInstantOpt;
  final HoodieTable table;
  try {
table = createTable(config, hadoopConf);
  } catch (Exception e) {
throw new HoodieRollbackException("Failed to initalize table for rollback " 
+ config.getBasePath() + " commits " + commitInstantTime, e);
  }
  final String rollbackInstantTime;
  final boolean 

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-02 Thread Krishen Bhan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishen Bhan updated HUDI-6596:
---
Description: 
h1. Issue

The existing rollback API in 0.14 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
 executes a rollback plan, either taking in an existing rollback plan provided 
by the caller for a previous rollback or attempt, or scheduling a new rollback 
instant if none is provided. Currently it is not safe for two concurrent jobs 
to call this API (when skipLocking=False and the callers aren't already holding 
a lock), as this can lead to an issue where multiple rollback requested plans 
are created or two jobs are executing the same rollback instant at the same 
time.
h1. Proposed change

One way to resolve this issue is to refactor this rollback function such that 
if skipLocking=false, the following steps are followed
 # Acquire the table lock
 # Reload the active timeline
 # Look at the active timeline to see if there is a inflight rollback instant 
from a previous rollback attempt, if it exists then assign this is as the 
rollback plan to execute. Also, check if a pending rollback plan was passed in 
by caller. Then it executes the following steps depending on whether the caller 
passed a pending rollback instant plan
 ##  [a] If a pending inflight rollback plan was passed in by caller, then 
check that there is a previous attempted rollback instant on timeline (and that 
the instant times match) and continue to use this rollback plan. If that isn't 
the case, then raise a rollback exception since this means another job has 
concurrently already executed this plan. Note that in a valid HUDI dataset 
there can be at most one rollback instant for a corresponding commit instant, 
which is why if we no longer see a pending rollback in timeline in this phase 
we can safely assume that it had already been executed to completion.
 ##  [b] If no pending inflight rollback plan was passed in by caller then 
schedule a new rollback plan if no pending rollback instant was found in 
timeline earlier.
 # Now that a rollback plan and requested rollback instant time has been 
assigned, check for an active heartbeat for the rollback instant time. If there 
is one, then abort the rollback as that means there is a concurrent job 
executing that rollback. If not, then start a heartbeat for that rollback 
instant time.
 # Release the table lock
 # Execute the rollback plan and complete the rollback instant. Whether this 
succeeds or fails with an exception, close the heartbeat. This increases the 
chance that the next job that tries to call this rollback API will follow 
through with the rollback and not abort due to an active previous heartbeat

 
 * These steps will only be enforced for  skipLocking=false, since if  
skipLocking=true then that means the caller may already be explicitly holding a 
table lock. In this case, acquiring the lock again in step (1) will fail.
 * Acquiring a lock and reloading timeline for (1-3) will guard against data 
race conditions where another job calls this rollback API at same time and 
schedules its own rollback plan and instant. This is since if no rollback has 
been attempted before for this instant, then before step (1), there is a window 
of time where another concurrent rollback job could have scheduled a rollback 
plan, failed execution, and cleaned up heartbeat, all while the current 
rollback job is running. As a result, even if the current job was passed in an 
empty pending rollback plan, it still needs to check the active timeline to 
ensure that no new rollback pending instant has been created. 
 * Using a heartbeat will signal to other callers in other jobs that there is 
another job already executing this rollback. Checking for expired heartbeat and 
(re)-starting the heartbeat has to be done under a lock, so that multiple jobs 
don't each start it at the same time and assume that they are the only ones 
that are heartbeating. 
 * The table lock is no longer needed after (5), since it can now be safely 
assumed that no other job (calling this rollback API) will execute this 
rollback instant. 

One example implementation to achieve this:

 
{code:java}
@Deprecated
public boolean rollback(final String commitInstantTime, 
Option pendingRollbackInfo, boolean skipLocking,
Option rollbackInstantTimeOpt) throws HoodieRollbackException {
  final Timer.Context timerContext = this.metrics.getRollbackCtx();
  final Option commitInstantOpt;
  final HoodieTable table;
  try {
table = createTable(config, hadoopConf);
  } catch (Exception e) {
throw new HoodieRollbackException("Failed to initalize table for rollback " 
+ config.getBasePath() + " commits " + commitInstantTime, e);
  }
  final String rollbackInstantTime;
  final boolean deleteInstantsDuringRollback;
 

[GitHub] [hudi] bhasudha commented on a diff in pull request #9338: [DOCS] Update bootstrap page

2023-08-02 Thread via GitHub


bhasudha commented on code in PR #9338:
URL: https://github.com/apache/hudi/pull/9338#discussion_r1282377412


##
website/docs/migration_guide.md:
##
@@ -69,12 +79,28 @@ for partition in [list of partitions in source table] {
 }
 ```  
 
-**Option 3**
+**Option 3 using Spark SQL CALL Procedure**
+
+Refer to [Bootstrap 
procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more 
details. 
+
+**Option 4 using Hudi CLI**
+
 Write your own custom logic of how to load an existing table into a Hudi 
managed one. Please read about the RDD API
 [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has 
been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
 hudi->bootstrap run --srcPath /tmp/source_table --targetPath 
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType 
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField 
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs 
hoodie.datasource.write.hive_style_partitioning=true --selectorClass 
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
-Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, 
see detalis with help "bootstrap run".
+Unlike Hudi Streamer, FULL_RECORD or METADATA_ONLY is set with 
--selectorClass, see details with help "bootstrap run".
+
+
+## Configs
+
+Here are the basic configs that control bootstrapping.
+
+| Config Name  | Default| 
Description 
|
+|  | -- | 
---
 |
+| hoodie.bootstrap.base.path | N/A **(Required)** | Base path of the dataset 
that needs to be bootstrapped as a Hudi table`Config Param: 
BASE_PATH``Since Version: 0.6.0` |
+
+By default, with only `hoodie.bootstrap.base.path` being provided 
METADATA_ONLY mode is selected. For other options, please refer [bootstrap 
configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) 
for more details.

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils

2023-08-02 Thread via GitHub


xushiyan commented on PR #9001:
URL: https://github.com/apache/hudi/pull/9001#issuecomment-1662902015

   > need close JavaTimeModule For JsonUtils
   
   @xuzifu666 can you help me understand what the PR title means?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9276:
URL: https://github.com/apache/hudi/pull/9276#issuecomment-1662877299

   
   ## CI report:
   
   * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN
   * 87e8f76e3d97d5b3b2fc10fe7704395575cc1b79 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19005)
 
   * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jonvex commented on a diff in pull request #9338: [DOCS] Update bootstrap page

2023-08-02 Thread via GitHub


jonvex commented on code in PR #9338:
URL: https://github.com/apache/hudi/pull/9338#discussion_r1282347510


##
website/docs/migration_guide.md:
##
@@ -56,11 +64,13 @@ spark-submit --master local \
 --hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
 --hoodie-conf 
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
 \

Review Comment:
   I don't think we need `hoodie-conf hoodie.bootstrap.full.input.provider` in 
the example



##
website/docs/migration_guide.md:
##
@@ -69,12 +79,28 @@ for partition in [list of partitions in source table] {
 }
 ```  
 
-**Option 3**
+**Option 3 using Spark SQL CALL Procedure**
+
+Refer to [Bootstrap 
procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more 
details. 
+
+**Option 4 using Hudi CLI**
+
 Write your own custom logic of how to load an existing table into a Hudi 
managed one. Please read about the RDD API
 [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has 
been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
 hudi->bootstrap run --srcPath /tmp/source_table --targetPath 
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType 
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField 
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs 
hoodie.datasource.write.hive_style_partitioning=true --selectorClass 
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
-Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, 
see detalis with help "bootstrap run".
+Unlike Hudi Streamer, FULL_RECORD or METADATA_ONLY is set with 
--selectorClass, see details with help "bootstrap run".
+
+
+## Configs
+
+Here are the basic configs that control bootstrapping.
+
+| Config Name  | Default| 
Description 
|
+|  | -- | 
---
 |
+| hoodie.bootstrap.base.path | N/A **(Required)** | Base path of the dataset 
that needs to be bootstrapped as a Hudi table`Config Param: 
BASE_PATH``Since Version: 0.6.0` |
+
+By default, with only `hoodie.bootstrap.base.path` being provided 
METADATA_ONLY mode is selected. For other options, please refer [bootstrap 
configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) 
for more details.

Review Comment:
   I think adding `hoodie.bootstrap.mode.selector.regex.mode`, 
`hoodie.bootstrap.mode.selector`, `hoodie.bootstrap.mode.selector.regex` to the 
simple configs would be helpful. At a minimum at least 
`hoodie.bootstrap.mode.selector` should be added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table

2023-08-02 Thread via GitHub


hudi-bot commented on PR #9261:
URL: https://github.com/apache/hudi/pull/9261#issuecomment-1662867526

   
   ## CI report:
   
   * c84b19381bc677b00b20e7f3ad3bc5c3b2287ef1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18985)
 
   * 5b6c8a9f7e241fb76bc7112881e0a9cbbeb07a12 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19012)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >