Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780459956

   
   ## CI report:
   
   * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495)
 
   * d5132f11b2cd4fff06c286ef9741dbaa80fa0463 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20500)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9922:
URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780453606

   
   ## CI report:
   
   * a9b3bd0f7aa15a9fbef3caaae798aa34790e027a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20499)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780453442

   
   ## CI report:
   
   * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495)
 
   * d5132f11b2cd4fff06c286ef9741dbaa80fa0463 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9922:
URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780446929

   
   ## CI report:
   
   * a9b3bd0f7aa15a9fbef3caaae798aa34790e027a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780446730

   
   ## CI report:
   
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on PR #9922:
URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780425889

   Thanks for the fix, from high-level, I kind of think we should avoid to 
relies on the Spark mechanisms to add any rollback/cleaning improvement here, 
it's hacky to maintain and it is not tenable for all engines.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on code in PR #9922:
URL: https://github.com/apache/hudi/pull/9922#discussion_r1372590546


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##
@@ -272,11 +272,12 @@ private static Path makeNewPath(FileSystem fs, String 
partitionPath, String file
*
* @param partitionPath Partition path
*/
-  private static void createMarkerFile(String partitionPath,
+  private void createMarkerFile(String partitionPath,
String dataFileName,
String instantTime,
HoodieTable table,
HoodieWriteConfig writeConfig) {
+stopIfAborted();

Review Comment:
   Is creating marker file the right time to abort the tasks ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on PR #9886:
URL: https://github.com/apache/hudi/pull/9886#issuecomment-1780419088

   Nice catch @xicm ,
   
   We may need to check whether 'config.getColumnsEnabledForColumnStatsIndex()' 
contains the `hoodie.table.recordkey.fields` field,
   
   - if 'config.getColumnsEnabledForColumnStatsIndex()' is empty,that means all 
the fields(including the metadata fields)are indexed in col_stats,then we can 
still use `hoodie.table.recordkey.fields` (caution that if 
`hoodie.table.recordkey.fields` is not configured,we can fallback to 
`_hoodie_record_key`);
   - if not empty,we need to check whether  `hoodie.table.recordkey.fields` is 
included in the col_stats,use it if if `hoodie.table.recordkey.fields` is 
included and throws exception otherwise.
   
   It's great if we can supplement some test cases that mentioned in 
https://github.com/apache/hudi/issues/9870 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side

2023-10-25 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An reassigned HUDI-6989:


Assignee: Hui An

> Stop handling more data if task is aborted & clean partial files if possible 
> in task side
> -
>
> Key: HUDI-6989
> URL: https://issues.apache.org/jira/browse/HUDI-6989
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
>
> Spark would set interrupt status in TaskContext if the task is aborted, HUDI 
> needs to respect that to stop immediately. Also, we can clean partial files 
> at task side to ensure these files won't be left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6989:
-
Labels: pull-request-available  (was: )

> Stop handling more data if task is aborted & clean partial files if possible 
> in task side
> -
>
> Key: HUDI-6989
> URL: https://issues.apache.org/jira/browse/HUDI-6989
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
>
> Spark would set interrupt status in TaskContext if the task is aborted, HUDI 
> needs to respect that to stop immediately. Also, we can clean partial files 
> at task side to ensure these files won't be left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-10-25 Thread via GitHub


boneanxs opened a new pull request, #9922:
URL: https://github.com/apache/hudi/pull/9922

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   1. `HoodieWriteHandle` needs to stop immediately if task is failed
   2. `TaskContextSupplier` adds a status to identify whether the task is 
failed.
   3. Clean files at task side if it fails
   
   When `Executor` tries to kill a task, spark will set the `TaskContext` as 
interrupted and interrupt the task thread.
   
   Spark internally holds an `InterruptibleIterator` to monitor task status and 
kill task if it's interrupted, but we can usually see e.g. `HoodieMergeHandle` 
still keep writing data even if executor already tries to kill it(see below). 
The reason is that during `init` in `HoodieMergeHandle`, it will first iterator 
`InterruptibleIterator` and build `keyToNewRecords`, if the kill status is 
coming after the `init` method, then `HoodieMergeHandle` might still write new 
records. 
   
   ```java
   23/03/23 02:28:45 INFO HoodieMergeHandle: MaxMemoryPerPartitionMerge => 
1073741824
   23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: another attempt succeeded
   23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in 
stage 11.0 (TID 1471), reason: Stage finished
   23/03/23 02:28:47 INFO HoodieMergeHandle: Number of entries in 
MemoryBasedMap => 0, Total size in bytes of MemoryBasedMap => 0, Number of 
entries in BitCaskDiskMap => 0, Size of file spilled to disk => 0
   23/03/23 02:28:47 INFO HoodieMergeHandle: partitionPath:grass_region=test, 
fileId to be merged:d3ee8406-4011-44a4-8913-8be0349a6686-0
   ```
   
   Btw, though in `BaseHoodieQueueBasedExecutor.execute` hudi could exit 
immediately if the task thread is interrupted, but in 
`BaseHoodieQueueBasedExecutor.awaitTermination`, hudi will clean `interrupt` 
exception and wait extra `60` sec to let the executor to proceed. So the task 
will wait at least 60s when it's been killed. We need to avoid that.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   None
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   None
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780406926

   
   ## CI report:
   
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9921:
URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780401311

   
   ## CI report:
   
   * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN
   * 8401cb3b04bea4ac0388d33ed40ad5853a5b7090 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20496)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Trino can't read tables created by Flink Hudi conector [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on issue #9435:
URL: https://github.com/apache/hudi/issues/9435#issuecomment-1780393004

   Were you capable to debug the local fs test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [bug] [fatal error] Severe bug seems to be a deadlock in the position of the BucketWrite Operator. [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on issue #9917:
URL: https://github.com/apache/hudi/issues/9917#issuecomment-1780391330

   Did you enable the checkpoint ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on code in PR #9904:
URL: https://github.com/apache/hudi/pull/9904#discussion_r1372556576


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java:
##
@@ -503,6 +503,20 @@ public static String rowDataToString(List rows) {
   public static void writeData(
   List dataBuffer,
   Configuration conf) throws Exception {
+writeData(dataBuffer, conf, 1);
+  }
+
+  /**
+   * Write a list of row data with Hoodie format base on the given 
configuration.
+   *
+   * @param dataBuffer The data buffer to write
+   * @param conf   The flink configuration
+   * @param ckpId  The checkpoint id
+   * @throws Exception if error occurs
+   */
+  public static void writeData(
+  List dataBuffer,

Review Comment:
   The checkpoint id does not affect the data write, there is no need to 
specify it explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9834:
URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780377139

   
   ## CI report:
   
   * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250)
 
   * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6821) Make multiple base file formats within each file group.

2023-10-25 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6821.
-
Resolution: Done

> Make multiple base file formats within each file group.
> ---
>
> Key: HUDI-6821
> URL: https://issues.apache.org/jira/browse/HUDI-6821
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Ability to mix different types of base files within a single table or even a 
> single file group (e.g images, json, vectors ...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-6821] Support multiple base file formats in Hudi table (#9761)

2023-10-25 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8bf44c01b56 [HUDI-6821] Support multiple base file formats in Hudi 
table (#9761)
8bf44c01b56 is described below

commit 8bf44c01b56dd3afe5323dc7566971cee2e46d50
Author: Sagar Sumit 
AuthorDate: Thu Oct 26 09:27:02 2023 +0530

[HUDI-6821] Support multiple base file formats in Hudi table (#9761)
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  11 +-
 .../java/org/apache/hudi/io/HoodieWriteHandle.java |   3 +-
 .../java/org/apache/hudi/table/HoodieTable.java|  10 +-
 .../table/action/bootstrap/BootstrapUtils.java |   9 +-
 ...sistentHashingBucketClusteringPlanStrategy.java |   4 +-
 .../rollback/ListingBasedRollbackStrategy.java |   6 +-
 .../table/upgrade/ZeroToOneUpgradeHandler.java |   7 +-
 .../io/storage/row/HoodieRowDataCreateHandle.java  |   4 +-
 .../client/TestHoodieJavaWriteClientInsert.java|   4 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  |   5 -
 .../TestHoodieJavaClientOnCopyOnWriteStorage.java  |   3 +-
 .../commit/TestJavaCopyOnWriteActionExecutor.java  |   4 +-
 .../testutils/HoodieJavaClientTestHarness.java |   4 +
 .../SparkBootstrapCommitActionExecutor.java|   2 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  14 +-
 .../table/action/bootstrap/TestBootstrapUtils.java |  12 +-
 .../commit/TestCopyOnWriteActionExecutor.java  |   5 +-
 .../TestHoodieSparkMergeOnReadTableRollback.java   |   2 +-
 .../hudi/testutils/HoodieClientTestBase.java   |   5 +
 .../testutils/HoodieSparkClientTestHarness.java|   5 -
 .../apache/hudi/common/model/HoodieFileFormat.java |   9 +
 .../hudi/common/table/HoodieTableConfig.java   |  10 +
 .../hudi/common/table/HoodieTableMetaClient.java   |  19 +-
 .../org/apache/hudi/common/util/BaseFileUtils.java |   5 -
 .../org/apache/hudi/common/fs/TestFSUtils.java |  27 ++
 .../hudi/common/testutils/HoodieTestTable.java |   3 +-
 .../org/apache/hudi/BaseFileOnlyRelation.scala |   4 +-
 .../main/scala/org/apache/hudi/DefaultSource.scala |  52 ++--
 .../scala/org/apache/hudi/HoodieBaseRelation.scala | 107 +++-
 ...tils.scala => HoodieSparkFileFormatUtils.scala} |  35 +--
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |   9 +-
 .../hudi/MergeOnReadIncrementalRelation.scala  |   4 +-
 .../apache/hudi/MergeOnReadSnapshotRelation.scala  |  92 ---
 .../sql/catalyst/catalog/HoodieCatalogTable.scala  |   4 +-
 .../datasources/HoodieMultipleBaseFileFormat.scala | 278 +
 .../spark/sql/hudi/ProvidesHoodieConfig.scala  |   2 +-
 .../RepairMigratePartitionMetaProcedure.scala  |   2 +-
 .../org/apache/hudi/functional/TestBootstrap.java  |   8 +-
 .../apache/hudi/functional/TestOrcBootstrap.java   |   8 +-
 .../apache/hudi/testutils/DataSourceTestUtils.java |  20 +-
 .../TestHoodieMultipleBaseFileFormat.scala | 123 +
 .../datasources/Spark32NestedSchemaPruning.scala   |   3 +-
 .../hudi/utilities/streamer/HoodieStreamer.java|  10 +-
 43 files changed, 712 insertions(+), 241 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index cc3876338cc..5ae7ab25fbd 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -219,7 +219,7 @@ public class HoodieWriteConfig extends HoodieConfig {
   + "the timeline as an immutable log relying only on atomic writes 
for object storage.");
 
   public static final ConfigProperty BASE_FILE_FORMAT = 
ConfigProperty
-  .key("hoodie.table.base.file.format")
+  .key("hoodie.base.file.format")
   .defaultValue(HoodieFileFormat.PARQUET)
   .withValidValues(HoodieFileFormat.PARQUET.name(), 
HoodieFileFormat.ORC.name(), HoodieFileFormat.HFILE.name())
   .withAlternatives("hoodie.table.ro.file.format")
@@ -1198,6 +1198,10 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getString(BASE_PATH);
   }
 
+  public HoodieFileFormat getBaseFileFormat() {
+return HoodieFileFormat.valueOf(getStringOrDefault(BASE_FILE_FORMAT));
+  }
+
   public HoodieRecordMerger getRecordMerger() {
 List mergers = 
StringUtils.split(getStringOrDefault(RECORD_MERGER_IMPLS), ",").stream()
 .map(String::trim)
@@ -2705,6 +2709,11 @@ public class HoodieWriteConfig extends HoodieConfig {
   return this;
 }
 
+public Builder withBaseFileFormat(String baseFileFormat) {
+  writeConfig.setValue(BASE_FILE_FORMAT, 
HoodieFileFormat.valueOf(baseFileFormat).name());
+  return this;
+}
+
 public Builder 

Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


codope merged PR #9761:
URL: https://github.com/apache/hudi/pull/9761


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side

2023-10-25 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-6989:
-
Summary: Stop handling more data if task is aborted & clean partial files 
if possible in task side  (was: Stop ingesting more data if task is aborted & 
clean partial files if possible in task side)

> Stop handling more data if task is aborted & clean partial files if possible 
> in task side
> -
>
> Key: HUDI-6989
> URL: https://issues.apache.org/jira/browse/HUDI-6989
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Priority: Major
>
> Spark would set interrupt status in TaskContext if the task is aborted, HUDI 
> needs to respect that to stop immediately. Also, we can clean partial files 
> at task side to ensure these files won't be left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6989) Stop ingesting more data if task is aborted & clean partial files if possible in task side

2023-10-25 Thread Hui An (Jira)
Hui An created HUDI-6989:


 Summary: Stop ingesting more data if task is aborted & clean 
partial files if possible in task side
 Key: HUDI-6989
 URL: https://issues.apache.org/jira/browse/HUDI-6989
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Hui An


Spark would set interrupt status in TaskContext if the task is aborted, HUDI 
needs to respect that to stop immediately. Also, we can clean partial files at 
task side to ensure these files won't be left.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9921:
URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780371965

   
   ## CI report:
   
   * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN
   * 8401cb3b04bea4ac0388d33ed40ad5853a5b7090 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


nsivabalan commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1372547231


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -173,7 +175,12 @@ private RecordIterator(Schema readerSchema, Schema 
writerSchema, byte[] content)
 this.totalRecords = this.dis.readInt();
   }
 
-  this.reader = new GenericDatumReader<>(writerSchema, readerSchema);
+  if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, 
readerSchema)) {

Review Comment:
   again, lets take an informed decision if we want to do it in this patch or a 
follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]

2023-10-25 Thread via GitHub


harsh1231 commented on code in PR #9834:
URL: https://github.com/apache/hudi/pull/9834#discussion_r1372547228


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java:
##
@@ -120,6 +120,11 @@ public static Dataset 
sanitizeColumnNamesForAvro(Dataset inputDataset,
 return targetDataset;
   }
 
+  public static Dataset sanitizeColumnNamesForAvro(Dataset 
inputDataset, TypedProperties props) {

Review Comment:
   This test has coverage for above method 
https://github.com/apache/hudi/blob/de2b5c95028029ff06d1f360763ca3f83c661ff3/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java#L131



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on PR #9911:
URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780367765

   Retriggered the failed tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9921:
URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780366795

   
   ## CI report:
   
   * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1372535469


##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -160,9 +169,11 @@ public Map generateMetadataForRecord(
* @param schema The Avro schema of the record.
* @return A mapping containing the metadata.
*/
-  public Map generateMetadataForRecord(T record, Schema 
schema) {
+  public Map generateMetadataForRecord(T record, Schema 
schema, boolean isPartial) {
 Map meta = new HashMap<>();
 meta.put(INTERNAL_META_RECORD_KEY, getRecordKey(record, schema));
+meta.put(INTERNAL_META_SCHEMA, schema);
+meta.put(INTERNAL_META_IS_PARTIAL, isPartial);

Review Comment:
   I'm wondering whether we can represent the metadata as a POJO to make the 
interface more explicit and clear.



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java:
##
@@ -94,16 +94,17 @@ public Comparable getOrderingValue(Option 
rowOption,
 
   @Override
   public HoodieRecord constructHoodieRecord(Option 
rowOption,
- Map 
metadataMap,
- Schema schema) {
+ Map 
metadataMap) {
 if (!rowOption.isPresent()) {
   return new HoodieEmptyRecord<>(
   new HoodieKey((String) metadataMap.get(INTERNAL_META_RECORD_KEY),
   (String) metadataMap.get(INTERNAL_META_PARTITION_PATH)),
   HoodieRecord.HoodieRecordType.SPARK);
 }
 
+Schema schema = (Schema) metadataMap.get(INTERNAL_META_SCHEMA);
 InternalRow row = rowOption.get();
+boolean isPartial = (boolean) 
metadataMap.getOrDefault(INTERNAL_META_IS_PARTIAL, false);
 return new HoodieSparkRecord(row, 
HoodieInternalRowUtils.getCachedSchema(schema));

Review Comment:
   The `isPartial` is never used.



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##
@@ -51,6 +61,28 @@ class 
SparkFileFormatInternalRowReaderContext(baseFileReader: PartitionedFile =>
  requiredSchema: Schema,
  conf: Configuration): 
ClosableIterator[InternalRow] = {
 val fileInfo = 
sparkAdapter.getSparkPartitionedFileUtils.createPartitionedFile(partitionValues,
 filePath, start, length)
-new CloseableInternalRowIterator(baseFileReader.apply(fileInfo))
+if (filePath.toString.contains(HoodieLogFile.DELTA_EXTENSION)) {

Review Comment:
   Use `FsUtils.isLogFile` instead.



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##
@@ -151,7 +151,7 @@ public HoodieFileGroupReader(HoodieReaderContext 
readerContext,
   public void initRecordIterators() {
 this.baseFileIterator = baseFilePath.isPresent()
 ? readerContext.getFileRecordIterator(
-baseFilePath.get().getHadoopPath(), start, length, 
readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf)
+baseFilePath.get().getHadoopPath(), start, length, 
readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf)

Review Comment:
   Unnecessary change?



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieKeyBasedFileGroupRecordBuffer.java:
##
@@ -127,10 +124,12 @@ public boolean hasNext() throws IOException {
 
   String recordKey = readerContext.getRecordKey(baseRecord, 
baseFileSchema);
   Pair, Map> logRecordInfo = 
records.remove(recordKey);
+  Map metadata = readerContext.generateMetadataForRecord(
+  baseRecord, baseFileSchema, false);

Review Comment:
   Caution for the performace regression for per-record metadata construction.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java:
##
@@ -70,9 +71,11 @@ public Option> merge(HoodieRecord 
older, Schema oldSc
   }
 }
 if (older.getOrderingValue(oldSchema, 
props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) {
-  return Option.of(Pair.of(older, oldSchema));
+  return Option.of(SparkPartialMergingUtils.mergePartialRecords(
+  (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, 
oldSchema, props));

Review Comment:
   The partial merge may not happen, so maybe give the utility a better name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


nsivabalan commented on code in PR #9743:
URL: https://github.com/apache/hudi/pull/9743#discussion_r1372446787


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -116,9 +116,24 @@ public static String getAvroRecordQualifiedName(String 
tableName) {
 return "hoodie." + sanitizedTableName + "." + sanitizedTableName + 
"_record";
   }
 
+  /**
+   * Validate whether the {@code targetSchema} is a valid evolution of {@code 
sourceSchema}.
+   * Basically {@link #isCompatibleProjectionOf(Schema, Schema)} but type 
promotion in the
+   * opposite direction
+   */
+  public static boolean isValidEvolutionOf(Schema sourceSchema, Schema 
targetSchema) {
+return (sourceSchema.getType() == Schema.Type.NULL) || 
isProjectionOfInternal(sourceSchema, targetSchema,
+AvroSchemaUtils::isAtomicSchemasCompatibleEvolution);
+  }
+
+  private static boolean isAtomicSchemasCompatibleEvolution(Schema 
oneAtomicType, Schema anotherAtomicType) {

Review Comment:
   can we write extensive docs on these methods. in general we have not been 
very comfortable in touching these part of code. may be meng tao and few others 
are, but rest of the PMCs generally have been very cautious. Can you add more 
docs around these methods so its easier for maintenance going forward



##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -79,6 +81,14 @@ public class HoodieCommonConfig extends HoodieConfig {
   + " operation will fail schema compatibility check. Set this option 
to true will make the newly added "
   + " column nullable to successfully complete the write operation.");
 
+  public static final ConfigProperty ADD_NULL_FOR_DELETED_COLUMNS = 
ConfigProperty
+  .key("hoodie.datasource.add.null.for.deleted.columns")

Review Comment:
   "hoodie.datasource.set.null.for.missing.columns" 



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java:
##
@@ -173,7 +175,12 @@ private RecordIterator(Schema readerSchema, Schema 
writerSchema, byte[] content)
 this.totalRecords = this.dis.readInt();
   }
 
-  this.reader = new GenericDatumReader<>(writerSchema, readerSchema);
+  if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, 
readerSchema)) {

Review Comment:
   We should try and unify our convention across the code base. 
   We use reader and writer schema here. we use table schema and source schema 
in outer layers. 
   we use prevSchema in some cases. 
   sourceSchema and targetSchema in few other places. 
   
   We should try to align all these and use a standard terminology throughout. 
   May be reader and writer in write handle classes. 
   and source Schema and targetSchema else where. 



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -359,7 +360,8 @@ public Option> 
getLastCommitMetadataWi
 return Option.fromJavaOptional(
 getCommitMetadataStream()
 .filter(instantCommitMetadataPair ->
-
!StringUtils.isNullOrEmpty(instantCommitMetadataPair.getValue().getMetadata(HoodieCommitMetadata.SCHEMA_KEY)))
+
!StringUtils.isNullOrEmpty(instantCommitMetadataPair.getValue().getMetadata(HoodieCommitMetadata.SCHEMA_KEY))
+&& 
!WriteOperationType.schemaCantChange(instantCommitMetadataPair.getRight().getOperationType()))

Review Comment:
   minor. can you switch the order of conditions. 
   lets first check for operation type. and then check for SCHEMA_KEY in extra 
metadata



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.config.HoodieConfig
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.internal.schema.InternalSchema
+
+/**
+ * Util methods for Schema evolution in Hudi
+ */
+object HoodieSchemaUtils {
+  /**
+   * get latest internalSchema from table
+   *
+   * @param config  

Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1372536310


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala:
##
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+import 
org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, 
REALTIME_SKIP_MERGE_OPT_VAL}
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, 
HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, 
PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.spark.broadcast.Broadcast
+import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * File format that supports reading multiple base file formats in a table.
+ */
+class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState],
+   tableSchema: Broadcast[HoodieTableSchema],
+   tableName: String,
+   mergeType: String,
+   mandatoryFields: Seq[String],
+   isMOR: Boolean) extends FileFormat with 
SparkAdapterSupport {
+  private val parquetFormat = new ParquetFileFormat()
+  private val orcFormat = new OrcFileFormat()
+
+  override def inferSchema(sparkSession: SparkSession,
+   options: Map[String, String],
+   files: Seq[FileStatus]): Option[StructType] = {
+// This is a simple heuristic assuming all files have the same extension.
+val fileFormat = detectFileFormat(files.head.getPath.toString)
+
+fileFormat match {
+  case "parquet" => parquetFormat.inferSchema(sparkSession, options, files)
+  case "orc" => orcFormat.inferSchema(sparkSession, options, files)
+  case _ => throw new UnsupportedOperationException(s"File format 
$fileFormat is not supported.")
+}
+  }
+
+  override def isSplitable(sparkSession: SparkSession, options: Map[String, 
String], path: Path): Boolean = {
+false
+  }
+
+  // Used so that the planner only projects once and does not stack overflow
+  var isProjected = false
+
+  /**
+   * Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
+   * while the other side can't
+   */
+  private var supportBatchCalled = false
+  private var supportBatchResult = false
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+if (!supportBatchCalled) {
+  supportBatchCalled = true
+  supportBatchResult =
+!isMOR && parquetFormat.supportBatch(sparkSession, schema) && 
orcFormat.supportBatch(sparkSession, schema)
+}
+supportBatchResult
+  }
+
+  override def prepareWrite(sparkSession: SparkSession,
+job: Job,
+options: Map[String, String],
+dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException("Write operations are not 
supported in this example.")
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  

Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1372534307


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala:
##
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+import 
org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, 
REALTIME_SKIP_MERGE_OPT_VAL}
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, 
HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, 
PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.spark.broadcast.Broadcast
+import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * File format that supports reading multiple base file formats in a table.
+ */
+class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState],
+   tableSchema: Broadcast[HoodieTableSchema],
+   tableName: String,
+   mergeType: String,
+   mandatoryFields: Seq[String],
+   isMOR: Boolean) extends FileFormat with 
SparkAdapterSupport {
+  private val parquetFormat = new ParquetFileFormat()
+  private val orcFormat = new OrcFileFormat()
+
+  override def inferSchema(sparkSession: SparkSession,
+   options: Map[String, String],
+   files: Seq[FileStatus]): Option[StructType] = {
+// This is a simple heuristic assuming all files have the same extension.
+val fileFormat = detectFileFormat(files.head.getPath.toString)
+
+fileFormat match {
+  case "parquet" => parquetFormat.inferSchema(sparkSession, options, files)
+  case "orc" => orcFormat.inferSchema(sparkSession, options, files)
+  case _ => throw new UnsupportedOperationException(s"File format 
$fileFormat is not supported.")
+}
+  }
+
+  override def isSplitable(sparkSession: SparkSession, options: Map[String, 
String], path: Path): Boolean = {
+false
+  }
+
+  // Used so that the planner only projects once and does not stack overflow
+  var isProjected = false
+
+  /**
+   * Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
+   * while the other side can't
+   */
+  private var supportBatchCalled = false
+  private var supportBatchResult = false
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+if (!supportBatchCalled) {
+  supportBatchCalled = true
+  supportBatchResult =
+!isMOR && parquetFormat.supportBatch(sparkSession, schema) && 
orcFormat.supportBatch(sparkSession, schema)
+}
+supportBatchResult
+  }
+
+  override def prepareWrite(sparkSession: SparkSession,
+job: Job,
+options: Map[String, String],
+dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException("Write operations are not 
supported in this example.")
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  

Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1372535019


##
hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java:
##
@@ -2763,10 +2762,6 @@ private void validateMetadata(HoodieJavaWriteClient 
testClient, Option i
   // Metadata table is MOR
   assertEquals(metadataMetaClient.getTableType(), 
HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR");
 
-  // Metadata table is HFile format
-  assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), 
HoodieFileFormat.HFILE,
-  "Metadata Table base file format should be HFile");
-
   // Metadata table has a fixed number of partitions

Review Comment:
   Going forward we'll have to remove this check as we can have multiple file 
formats even in metdata table when we support certain secondary indexes in 
other than HFile format. This check also did not add much value anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1372534307


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala:
##
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+import 
org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, 
REALTIME_SKIP_MERGE_OPT_VAL}
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, 
HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, 
PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.spark.broadcast.Broadcast
+import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * File format that supports reading multiple base file formats in a table.
+ */
+class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState],
+   tableSchema: Broadcast[HoodieTableSchema],
+   tableName: String,
+   mergeType: String,
+   mandatoryFields: Seq[String],
+   isMOR: Boolean) extends FileFormat with 
SparkAdapterSupport {
+  private val parquetFormat = new ParquetFileFormat()
+  private val orcFormat = new OrcFileFormat()
+
+  override def inferSchema(sparkSession: SparkSession,
+   options: Map[String, String],
+   files: Seq[FileStatus]): Option[StructType] = {
+// This is a simple heuristic assuming all files have the same extension.
+val fileFormat = detectFileFormat(files.head.getPath.toString)
+
+fileFormat match {
+  case "parquet" => parquetFormat.inferSchema(sparkSession, options, files)
+  case "orc" => orcFormat.inferSchema(sparkSession, options, files)
+  case _ => throw new UnsupportedOperationException(s"File format 
$fileFormat is not supported.")
+}
+  }
+
+  override def isSplitable(sparkSession: SparkSession, options: Map[String, 
String], path: Path): Boolean = {
+false
+  }
+
+  // Used so that the planner only projects once and does not stack overflow
+  var isProjected = false
+
+  /**
+   * Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
+   * while the other side can't
+   */
+  private var supportBatchCalled = false
+  private var supportBatchResult = false
+
+  override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
+if (!supportBatchCalled) {
+  supportBatchCalled = true
+  supportBatchResult =
+!isMOR && parquetFormat.supportBatch(sparkSession, schema) && 
orcFormat.supportBatch(sparkSession, schema)
+}
+supportBatchResult
+  }
+
+  override def prepareWrite(sparkSession: SparkSession,
+job: Job,
+options: Map[String, String],
+dataSchema: StructType): OutputWriterFactory = {
+throw new UnsupportedOperationException("Write operations are not 
supported in this example.")
+  }
+
+  override def buildReaderWithPartitionValues(sparkSession: SparkSession,
+  

Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]

2023-10-25 Thread via GitHub


zyclove commented on issue #9902:
URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780347759

   In addition, when submitting a task with spark-submit, in addition to adding 
configuration in the code or specifying a configuration file, can the 
configuration be added dynamically when submitting the task?
   @ad1happy2go 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]

2023-10-25 Thread via GitHub


zyclove commented on issue #9902:
URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780345033

   
[hoodie.avro.schema.external.transformation](https://hudi.apache.org/docs/configurations#hoodieavroschemaexternaltransformation)
   
   Check the hudi code to see if you can set this configuration to true.
   
   ```java
 public static final ConfigProperty 
AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE = ConfigProperty
 .key(AVRO_SCHEMA_STRING.key() + ".external.transformation")
 .defaultValue("false")
 .withAlternatives(AVRO_SCHEMA_STRING.key() + ".externalTransformation")
 .markAdvanced()
 .withDocumentation("When enabled, records in older schema are 
rewritten into newer schema during upsert,delete and background"
 + " compaction,clustering operations.");
   ``` 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on code in PR #9761:
URL: https://github.com/apache/hudi/pull/9761#discussion_r1372503591


##
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java:
##
@@ -602,10 +601,6 @@ private void runFullValidation(HoodieMetadataConfig 
metadataConfig,
 // Metadata table is MOR
 assertEquals(metadataMetaClient.getTableType(), 
HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR");
 

Review Comment:
   Why remove the check?



##
hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java:
##
@@ -2763,10 +2762,6 @@ private void validateMetadata(HoodieJavaWriteClient 
testClient, Option i
   // Metadata table is MOR
   assertEquals(metadataMetaClient.getTableType(), 
HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR");
 
-  // Metadata table is HFile format
-  assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), 
HoodieFileFormat.HFILE,
-  "Metadata Table base file format should be HFile");
-
   // Metadata table has a fixed number of partitions

Review Comment:
   Why remove this check



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala:
##
@@ -456,7 +456,7 @@ trait ProvidesHoodieConfig extends Logging {
 hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key, 
enableHive.toString)
 hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_MODE.key, 
props.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE.key, 
HiveSyncMode.HMS.name()))
 hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_PATH, 
hoodieCatalogTable.tableLocation)
-hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, 
hoodieCatalogTable.baseFileFormat)
+hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, 
props.getString(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.key, 
HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.defaultValue))
 hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_DATABASE_NAME, 
hoodieCatalogTable.table.identifier.database.getOrElse("default"))

Review Comment:
   Do we have function regression if user does not provide the option 
`HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT`?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala:
##
@@ -0,0 +1,278 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.Job
+import 
org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, 
REALTIME_SKIP_MERGE_OPT_VAL}
+import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, 
HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, 
PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport}
+import org.apache.spark.broadcast.Broadcast
+import 
org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.JoinedRow
+import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
+import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters.asScalaIteratorConverter
+
+/**
+ * File format that supports reading multiple base file formats in a table.
+ */
+class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState],
+   tableSchema: Broadcast[HoodieTableSchema],
+ 

Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]

2023-10-25 Thread via GitHub


zyclove commented on issue #9902:
URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780338605

   @ad1happy2go 
   In another task, after upgrading to version 0.14, field incompatibility 
issues were reported.
   Can it be restored without rebuilding the data table?
   For example, through the Schema Evolution feature
   
   ```
   Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
org.apache.avro.AvroRuntimeException: cannot support rewrite value for schema 
type: "long" since the old schema type is: "string"
at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:387)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:369)
at 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:79)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335)
... 28 more
   Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.avro.AvroRuntimeException: cannot support rewrite value for schema 
type: "long" since the old schema type is: "string"
at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:75)
at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147)
... 32 more
   Caused by: org.apache.avro.AvroRuntimeException: cannot support rewrite 
value for schema type: "long" since the old schema type is: "string"
at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryTypeWithDiffSchemaType(HoodieAvroUtils.java:1083)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryType(HoodieAvroUtils.java:1001)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:946)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:944)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:902)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873)
at 
org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:843)
at 
org.apache.hudi.common.model.HoodieAvroIndexedRecord.rewriteRecordWithNewSchema(HoodieAvroIndexedRecord.java:123)
at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$composeSchemaEvolutionTransformer$2(HoodieMergeHelper.java:209)
at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$runMerge$0(HoodieMergeHelper.java:134)
at 
org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:68)
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6949) Spark support non-blocking concurrency control

2023-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6949:
-
Labels: pull-request-available  (was: )

> Spark support non-blocking concurrency control
> --
>
> Key: HUDI-6949
> URL: https://issues.apache.org/jira/browse/HUDI-6949
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, spark-sql
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]

2023-10-25 Thread via GitHub


beyond1920 opened a new pull request, #9921:
URL: https://github.com/apache/hudi/pull/9921

   ### Change Logs
   
   The pr aims to support non-blocking concurrency control for spark jobs.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   NA
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6988) Query failure for 0.14.0

2023-10-25 Thread Lin Liu (Jira)
Lin Liu created HUDI-6988:
-

 Summary: Query failure for 0.14.0
 Key: HUDI-6988
 URL: https://issues.apache.org/jira/browse/HUDI-6988
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lin Liu


{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
ShuffleMapStage 1054 (run at AccessController.java:0) has failed the maximum 
allowable number of times: 4. Most recent failure 
reason:org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 259 partition 31  at 
org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1705)  
 at 
org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$10(MapOutputTracker.scala:1652)
   at 
org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$10$adapted(MapOutputTracker.scala:1651)
   at scala.collection.Iterator.foreach(Iterator.scala:943)at 
scala.collection.Iterator.foreach$(Iterator.scala:943)   at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)   at 
org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1651)
   at 
org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1294)
 at 
org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1256)
 at 
org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:140)
 at 
org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:63)   at 
org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:57)  at 
org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)
  at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:208) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) 
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)  
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)  
 at org.apache.spark.scheduler.Task.run(Task.scala:138)  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at java.lang.Thread.run(Thread.java:750)
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
   at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
   at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1995)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)  at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
 at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
   at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
  at 

Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-25 Thread via GitHub


ksmou commented on PR #9911:
URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780305248

   > I see some test failures:
   > 
   > ```java
   > testUpsertsCOWContinuousMode{HoodieRecordType}[1]  Time elapsed: 396.414 s 
 <<< ERROR!
   > ```
   > 
   > Not sure whether it is related.
   
   It's is not related. I test it local successful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Schema evolution copy [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9920:
URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780282165

   
   ## CI report:
   
   * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20493)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780282051

   
   ## CI report:
   
   * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on PR #9911:
URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780282030

   I see some test failures:
   
   ```java
   testUpsertsCOWContinuousMode{HoodieRecordType}[1]  Time elapsed: 396.414 s  
<<< ERROR!
   ```
   
   Not sure whether it is related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]

2023-10-25 Thread via GitHub


danny0405 commented on code in PR #9911:
URL: https://github.com/apache/hudi/pull/9911#discussion_r1371363530


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java:
##
@@ -63,21 +60,9 @@ public Comparator getComparator() {
 return comparator;
   }
 
-  @Override
-  public List orderAndFilter(HoodieWriteConfig 
writeConfig,
-  List operations, List 
pendingCompactionPlans) {
-// Iterate through the operations and accept operations as long as we are 
within the configured target partitions
-// limit
-return operations.stream()
-
.collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream()
-
.sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction())
-.flatMap(e -> e.getValue().stream()).collect(Collectors.toList());
-  }
-
   @Override
   public List filterPartitionPaths(HoodieWriteConfig writeConfig, 
List allPartitionPaths) {
-return allPartitionPaths.stream().map(partition -> partition.replace("/", 
"-"))
-.sorted(Comparator.reverseOrder()).map(partitionPath -> 
partitionPath.replace("-", "/"))
+return allPartitionPaths.stream().sorted(comparator)
 .collect(Collectors.toList()).subList(0, 
Math.min(allPartitionPaths.size(),

Review Comment:
   Okay, got it, caution that you have changed the comparator of the 
partitions, does that introduce any protential regressions ? Can we add some 
tests to conver it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6987) Support partition pruning with functional index

2023-10-25 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6987:
--
Fix Version/s: 1.0.0

> Support partition pruning with functional index
> ---
>
> Key: HUDI-6987
> URL: https://issues.apache.org/jira/browse/HUDI-6987
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> Current implementation can do data skipping if functional index exists. The 
> same can be leveraged for partition pruning if the function is on partition 
> field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6987) Support partition pruning with functional index

2023-10-25 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-6987:
-

 Summary: Support partition pruning with functional index
 Key: HUDI-6987
 URL: https://issues.apache.org/jira/browse/HUDI-6987
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit


Current implementation can do data skipping if functional index exists. The 
same can be leveraged for partition pruning if the function is on partition 
field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6986) Refactor new FileFormat implementations

2023-10-25 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-6986:
-

 Summary: Refactor new FileFormat implementations
 Key: HUDI-6986
 URL: https://issues.apache.org/jira/browse/HUDI-6986
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Sagar Sumit
 Fix For: 1.0.0


* Rename `NewHoodieParquetFileFormat`
 * Remove duplication between `NewHoodieParquetFileFormat`, 
`HoodieFileGroupReaderBasedFileFormat` and `HoodieMultipleBaseFileFormat`
 * `HoodieSparkFormatUtils` should be usable irrespective of whether one is 
using new file format or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Executor executes action [commits the instant 20230916074105355] error [hudi]

2023-10-25 Thread via GitHub


gtk96 commented on issue #9732:
URL: https://github.com/apache/hudi/issues/9732#issuecomment-1780260311

   > @gtk96 Were you able to confirm. Can we close this issue.
   
   hi @ad1happy2go  Our current version is 0.13 and has not been upgraded. I 
can't verify this. If you have confirmed to solve the problem, please close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1372442062


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -126,12 +128,13 @@ protected Option doProcessNextDataRecord(T record,
   // Merge and store the combined record
   // Note that the incoming `record` is from an older commit, so it should 
be put as
   // the `older` in the merge API
+
   HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge(
-  readerContext.constructHoodieRecord(Option.of(record), metadata, 
readerSchema),
-  readerSchema,
+  readerContext.constructHoodieRecord(Option.of(record), metadata),
+  (Schema) metadata.get(INTERNAL_META_SCHEMA),
   readerContext.constructHoodieRecord(
-  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight(), readerSchema),
-  readerSchema,
+  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight()),
+  (Schema) 
existingRecordMetadataPair.getRight().get(INTERNAL_META_SCHEMA),
   payloadProps).get().getLeft();

Review Comment:
   To clarify, for reading log files, the reader schema is fetched from the 
header.  Here we're doing record-level merging.  Depending the log file from 
which the records come, the schema could be different.  However, the reference 
to the schema is the same as the schema instance is passed from the log reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780214298

   
   ## CI report:
   
   * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491)
 
   * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1780214192

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 661b16906d31d259be3fac4707478bd71eb6f9a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20494)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-6985) Cannot find complete timestamp

2023-10-25 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu reassigned HUDI-6985:
-

Assignee: Lin Liu

> Cannot find complete timestamp
> --
>
> Key: HUDI-6985
> URL: https://issues.apache.org/jira/browse/HUDI-6985
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Completion time should not be 
> empty        at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
>         at 
> org.apache.hudi.common.table.timeline.HoodieInstant.getCompleteFileName(HoodieInstant.java:263)
>         at 
> org.apache.hudi.common.table.timeline.HoodieInstant.getFileName(HoodieInstant.java:297)
>         at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantFileName(HoodieActiveTimeline.java:344)
>         at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:351)
>         at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getRollbackedCommits(HoodieTableMetadataUtil.java:1372)
>         at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$getValidInstantTimestamps$38(HoodieTableMetadataUtil.java:1300)
>         at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
>         at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
>         at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
>         at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>         at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>         at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>         at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
>         at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>         at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
>         at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getValidInstantTimestamps(HoodieTableMetadataUtil.java:1299)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:476)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:432)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:417)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lookupKeysFromFileSlice(HoodieBackedTableMetadata.java:294)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:258)
>         at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:148)
>         at 
> org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:316)
>         at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:125)
>         ... 61 more
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?] 
>        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]     
>    at 
> com.microsoft.lst_bench.common.LSTBenchmarkExecutor.checkResults(LSTBenchmarkExecutor.java:165)
>  [lst-bench-0.1-SNAPSHOT.jar:?]        at 
> com.microsoft.lst_bench.common.LSTBenchmarkExecutor.execute(LSTBenchmarkExecutor.java:121)
>  [lst-bench-0.1-SNAPSHOT.jar:?]        at 
> com.microsoft.lst_bench.Driver.main(Driver.java:147) 
> [lst-bench-0.1-SNAPSHOT.jar:?]Caused by: java.sql.SQLException: 
> org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table        at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
> at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
>         at 
> 

[jira] [Created] (HUDI-6985) Cannot find complete timestamp

2023-10-25 Thread Lin Liu (Jira)
Lin Liu created HUDI-6985:
-

 Summary: Cannot find complete timestamp
 Key: HUDI-6985
 URL: https://issues.apache.org/jira/browse/HUDI-6985
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lin Liu


{code:java}
Caused by: java.lang.IllegalArgumentException: Completion time should not be 
empty        at 
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42)
        at 
org.apache.hudi.common.table.timeline.HoodieInstant.getCompleteFileName(HoodieInstant.java:263)
        at 
org.apache.hudi.common.table.timeline.HoodieInstant.getFileName(HoodieInstant.java:297)
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantFileName(HoodieActiveTimeline.java:344)
        at 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:351)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getRollbackedCommits(HoodieTableMetadataUtil.java:1372)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$getValidInstantTimestamps$38(HoodieTableMetadataUtil.java:1300)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at 
java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
        at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) 
       at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at 
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
       at 
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
        at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getValidInstantTimestamps(HoodieTableMetadataUtil.java:1299)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:476)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:432)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:417)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.lookupKeysFromFileSlice(HoodieBackedTableMetadata.java:294)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:258)
        at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:148)
        at 
org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:316)
        at 
org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:125)
        ... 61 more
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]   
     at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]        
at 
com.microsoft.lst_bench.common.LSTBenchmarkExecutor.checkResults(LSTBenchmarkExecutor.java:165)
 [lst-bench-0.1-SNAPSHOT.jar:?]        at 
com.microsoft.lst_bench.common.LSTBenchmarkExecutor.execute(LSTBenchmarkExecutor.java:121)
 [lst-bench-0.1-SNAPSHOT.jar:?]        at 
com.microsoft.lst_bench.Driver.main(Driver.java:147) 
[lst-bench-0.1-SNAPSHOT.jar:?]Caused by: java.sql.SQLException: 
org.apache.hive.service.cli.HiveSQLException: Error running query: 
org.apache.hudi.exception.HoodieException: Error fetching partition paths from 
metadata table        at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44)
        at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
        at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)        
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
        at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
        at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
        at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230)
        at 

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780208856

   
   ## CI report:
   
   * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491)
 
   * 6972591365be4bde76c7b41dc5122c63ffd18c79 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1780208689

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485)
 
   * 661b16906d31d259be3fac4707478bd71eb6f9a4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-25 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779694#comment-17779694
 ] 

Ethan Guo edited comment on HUDI-6793 at 10/25/23 11:11 PM:


This works after adding MOR snapshot query support with the new Hoodie parquet 
file format using new file group reader: HUDI-6786.


was (Author: guoyihua):
This works after adding MOR snapshot query support with the new Hoodie parquet 
file format using new file group reader.

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-25 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-6793.
-

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-25 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6793.
---
Resolution: Fixed

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader

2023-10-25 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779694#comment-17779694
 ] 

Ethan Guo commented on HUDI-6793:
-

This works after adding MOR snapshot query support with the new Hoodie parquet 
file format using new file group reader.

> Support time-travel read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6793
> URL: https://issues.apache.org/jira/browse/HUDI-6793
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader

2023-10-25 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6973.
---
Resolution: Fixed

> Instantiate HoodieFileGroupRecordBuffer inside new file group reader
> 
>
> Key: HUDI-6973
> URL: https://issues.apache.org/jira/browse/HUDI-6973
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6800) Implement log writing with partial updates on the write path

2023-10-25 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6800.
---
Resolution: Fixed

> Implement log writing with partial updates on the write path
> 
>
> Key: HUDI-6800
> URL: https://issues.apache.org/jira/browse/HUDI-6800
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Schema evolution copy [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9920:
URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780169741

   
   ## CI report:
   
   * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20493)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Schema evolution copy [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9920:
URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780162703

   
   ## CI report:
   
   * f98cbcb16737a88891703baeee15f5a6bd73e784 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-6800] Support writing partial updates to the data blocks in MOR tables (#9876)

2023-10-25 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ad4560f2a4 [HUDI-6800] Support writing partial updates to the data 
blocks in MOR tables (#9876)
0ad4560f2a4 is described below

commit 0ad4560f2a4de00e43814b0d6cef2886a8a38155
Author: Y Ethan Guo 
AuthorDate: Wed Oct 25 15:24:26 2023 -0700

[HUDI-6800] Support writing partial updates to the data blocks in MOR 
tables (#9876)

This commit adds the functionality to write partial updates to the data 
blocks in MOR tables, for Spark SQL MERGE INTO.
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  18 ++-
 .../org/apache/hudi/io/HoodieAppendHandle.java |  18 ++-
 .../java/org/apache/hudi/io/HoodieWriteHandle.java |   2 +-
 .../common/table/log/block/HoodieLogBlock.java |   2 +-
 .../org/apache/hudi/common/util/ConfigUtils.java   |  20 +--
 .../scala/org/apache/hudi/DataSourceOptions.scala  |   9 ++
 .../hudi/command/MergeIntoHoodieTableCommand.scala | 147 +++--
 .../hudi/command/payload/ExpressionPayload.scala   |  20 ++-
 .../apache/spark/sql/hudi/TestMergeIntoTable.scala |  12 +-
 .../spark/sql/hudi/TestMergeIntoTable2.scala   |   6 +
 .../sql/hudi/TestPartialUpdateForMergeInto.scala   |  83 ++--
 11 files changed, 268 insertions(+), 69 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 8c08beaaef9..cc3876338cc 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -33,6 +33,7 @@ import org.apache.hudi.common.config.HoodieMetaserverConfig;
 import org.apache.hudi.common.config.HoodieReaderConfig;
 import org.apache.hudi.common.config.HoodieStorageConfig;
 import org.apache.hudi.common.config.HoodieTableServiceManagerConfig;
+import org.apache.hudi.common.config.HoodieTimeGeneratorConfig;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.engine.EngineType;
 import org.apache.hudi.common.fs.ConsistencyGuardConfig;
@@ -50,7 +51,6 @@ import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.log.block.HoodieLogBlock;
 import org.apache.hudi.common.table.marker.MarkerType;
-import org.apache.hudi.common.config.HoodieTimeGeneratorConfig;
 import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
 import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.common.util.ConfigUtils;
@@ -756,6 +756,14 @@ public class HoodieWriteConfig extends HoodieConfig {
   .withDocumentation("Whether to write record positions to the block 
header for data blocks containing updates and delete blocks. "
   + "The record positions can be used to improve the performance of 
merging records from base and log files.");
 
+  public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = 
ConfigProperty
+  .key("hoodie.write.partial.update.schema")
+  .defaultValue("")
+  .markAdvanced()
+  .sinceVersion("1.0.0")
+  .withDocumentation("Avro schema of the partial updates. This is 
automatically set by the "
+  + "Hudi write client and user is not expected to manually change the 
value.");
+
   /**
* Config key with boolean value that indicates whether record being written 
during MERGE INTO Spark SQL
* operation are already prepped.
@@ -2072,6 +2080,14 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getBoolean(WRITE_RECORD_POSITIONS);
   }
 
+  public boolean shouldWritePartialUpdates() {
+return !StringUtils.isNullOrEmpty(getString(WRITE_PARTIAL_UPDATE_SCHEMA));
+  }
+
+  public String getPartialUpdateSchema() {
+return getString(WRITE_PARTIAL_UPDATE_SCHEMA);
+  }
+
   public double getParquetCompressionRatio() {
 return getDouble(HoodieStorageConfig.PARQUET_COMPRESSION_RATIO_FRACTION);
   }
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
index 4075541a750..cc1932ce27f 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
@@ -149,7 +149,14 @@ public class HoodieAppendHandle extends 
HoodieWriteHandle hoodieTable,
 String partitionPath, String fileId, 
Iterator> recordItr, TaskContextSupplier taskContextSupplier) {
-super(config, instantTime, 

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


yihua merged PR #9876:
URL: https://github.com/apache/hudi/pull/9876


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] org.apache.hudi.exception.HoodieRollbackException: Failed to rollback [hudi]

2023-10-25 Thread via GitHub


Armelabdelkbir commented on issue #9213:
URL: https://github.com/apache/hudi/issues/9213#issuecomment-1780128455

   @ad1happy2go  it seems good for me last months, sorry for the late response


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Schema evolution copy [hudi]

2023-10-25 Thread via GitHub


jonvex opened a new pull request, #9920:
URL: https://github.com/apache/hudi/pull/9920

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-6984) query64 failed.

2023-10-25 Thread Lin Liu (Jira)
Lin Liu created HUDI-6984:
-

 Summary: query64 failed.
 Key: HUDI-6984
 URL: https://issues.apache.org/jira/browse/HUDI-6984
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Lin Liu
Assignee: Lin Liu


{code:java}
[hadoop@ip-10-0-112-196 lst-bench]$ 2023-10-25T21:52:19,829 ERROR 
[pool-2-thread-1] common.LSTBenchmarkExecutor: Exception executing statement: 
query64.sql_02023-10-25T21:52:19,829 ERROR [pool-2-thread-1] 
common.LSTBenchmarkExecutor: Exception executing file: 
query64.sql2023-10-25T21:52:19,830 ERROR [pool-2-thread-1] 
common.LSTBenchmarkExecutor: Exception executing task: 
single_user_02023-10-25T21:52:19,834 ERROR [pool-2-thread-1] 
common.LSTBenchmarkExecutor: Exception executing session: 
02023-10-25T21:52:19,834  WARN [main] common.LSTBenchmarkExecutor: Thread did 
not finish correctlyjava.util.concurrent.ExecutionException: 
java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error 
running query: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 851 in stage 3093.0 failed 4 times, most recent failure: Lost 
task 851.3 in stage 3093.0 (TID 666996) 
(ip-10-0-103-0.us-west-2.compute.internal executor 7): ExecutorLostFailure 
(executor 7 exited caused by one of the running tasks) Reason: Executor Process 
LostDriver stacktrace:   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
   at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)   at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
   at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
  at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
   at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230)
  at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225)
  at java.base/java.security.AccessController.doPrivileged(Native Method) at 
java.base/javax.security.auth.Subject.doAs(Subject.java:423) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239)
  at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)Caused by: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 851 in 
stage 3093.0 failed 4 times, most recent failure: Lost task 851.3 in stage 
3093.0 (TID 666996) (ip-10-0-103-0.us-west-2.compute.internal executor 7): 
ExecutorLostFailure (executor 7 exited caused by one of the running tasks) 
Reason: Executor Process LostDriver stacktrace:at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
 at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
   at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
   at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239)
  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239)
  at scala.Option.foreach(Option.scala:407)   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993)
   at 

Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780111072

   
   ## CI report:
   
   * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780101745

   
   ## CI report:
   
   * 57481f626caf8864def8394c57316535fa490b90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20490)
 
   * 085a8583eb56ff4b8d3afa3636c657b11d0db92f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1780092114

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


yihua commented on code in PR #9894:
URL: https://github.com/apache/hudi/pull/9894#discussion_r1372349415


##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -119,21 +124,36 @@ protected Option doProcessNextDataRecord(T record,
   Map metadata,
   Pair, Map> existingRecordMetadataPair) throws IOException {
 if (existingRecordMetadataPair != null) {
-  // Merge and store the combined record
-  // Note that the incoming `record` is from an older commit, so it should 
be put as
-  // the `older` in the merge API
-  HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge(
-  readerContext.constructHoodieRecord(Option.of(record), metadata, 
readerSchema),
-  readerSchema,
-  readerContext.constructHoodieRecord(
-  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight(), readerSchema),
-  readerSchema,
-  payloadProps).get().getLeft();
-  // If pre-combine returns existing record, no need to update it
-  if (combinedRecord.getData() != 
existingRecordMetadataPair.getLeft().get()) {
-return Option.of(combinedRecord.getData());
+  switch (recordMergeMode) {
+case OVERWRITE_WITH_LATEST:
+  return Option.empty();
+case EVENT_TIME_ORDERING:
+  Comparable incomingOrderingValue = readerContext.getOrderingValue(
+  Option.of(record), metadata, readerSchema, payloadProps);
+  Comparable existingOrderingValue = readerContext.getOrderingValue(
+  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight(), readerSchema, payloadProps);
+  if (incomingOrderingValue.compareTo(existingOrderingValue) > 0) {
+return Option.of(record);
+  }
+  return Option.empty();

Review Comment:
   Yes, `existingRecordMetadataPair` should be in the log record mapping.  The 
convention here is that, if `Option.empty()` is returned from this method, the 
log record of the same record key in the mapping should not be updated, to 
avoid the `readerContext.seal`:
   ```
   @Override
 public void processNextDataRecord(T record, Map metadata, 
Object recordKey) throws IOException {
   Pair, Map> existingRecordMetadataPair = 
records.get(recordKey);
   Option mergedRecord = doProcessNextDataRecord(record, metadata, 
existingRecordMetadataPair);
   if (mergedRecord.isPresent()) {
 records.put(recordKey, 
Pair.of(Option.ofNullable(readerContext.seal(mergedRecord.get())), metadata));
   }
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


yihua commented on code in PR #9894:
URL: https://github.com/apache/hudi/pull/9894#discussion_r1372348212


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration 
configuration, String baseP
 throws IOException {
   return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, 
basePath, build());
 }
+
+private void validateMergeConfigs() {
+  boolean payloadClassNameSet = null != payloadClassName;
+  boolean payloadTypeSet = null != payloadType;
+  boolean recordMergerStrategySet = null != recordMergerStrategy;
+  boolean recordMergeModeSet = null != recordMergeMode;
+
+  checkArgument(recordMergeModeSet,
+  "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " 
should be set");

Review Comment:
   This is mandatory in the table config and during table upgrade, the merge 
mode should inferred from either the payload class name / type or record merger 
strategy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9894:
URL: https://github.com/apache/hudi/pull/9894#discussion_r1372317061


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration 
configuration, String baseP
 throws IOException {
   return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, 
basePath, build());
 }
+
+private void validateMergeConfigs() {
+  boolean payloadClassNameSet = null != payloadClassName;
+  boolean payloadTypeSet = null != payloadType;
+  boolean recordMergerStrategySet = null != recordMergerStrategy;
+  boolean recordMergeModeSet = null != recordMergeMode;
+
+  checkArgument(recordMergeModeSet,
+  "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " 
should be set");

Review Comment:
   Is it a mandatory config? How will it affect users upgrading to new version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


codope commented on code in PR #9894:
URL: https://github.com/apache/hudi/pull/9894#discussion_r1372317061


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration 
configuration, String baseP
 throws IOException {
   return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, 
basePath, build());
 }
+
+private void validateMergeConfigs() {
+  boolean payloadClassNameSet = null != payloadClassName;
+  boolean payloadTypeSet = null != payloadType;
+  boolean recordMergerStrategySet = null != recordMergerStrategy;
+  boolean recordMergeModeSet = null != recordMergeMode;
+
+  checkArgument(recordMergeModeSet,
+  "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " 
should be set");

Review Comment:
   Is it a mandatry config? How will it affect users upgrading to new version?



##
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java:
##
@@ -119,21 +124,36 @@ protected Option doProcessNextDataRecord(T record,
   Map metadata,
   Pair, Map> existingRecordMetadataPair) throws IOException {
 if (existingRecordMetadataPair != null) {
-  // Merge and store the combined record
-  // Note that the incoming `record` is from an older commit, so it should 
be put as
-  // the `older` in the merge API
-  HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge(
-  readerContext.constructHoodieRecord(Option.of(record), metadata, 
readerSchema),
-  readerSchema,
-  readerContext.constructHoodieRecord(
-  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight(), readerSchema),
-  readerSchema,
-  payloadProps).get().getLeft();
-  // If pre-combine returns existing record, no need to update it
-  if (combinedRecord.getData() != 
existingRecordMetadataPair.getLeft().get()) {
-return Option.of(combinedRecord.getData());
+  switch (recordMergeMode) {
+case OVERWRITE_WITH_LATEST:
+  return Option.empty();
+case EVENT_TIME_ORDERING:
+  Comparable incomingOrderingValue = readerContext.getOrderingValue(
+  Option.of(record), metadata, readerSchema, payloadProps);
+  Comparable existingOrderingValue = readerContext.getOrderingValue(
+  existingRecordMetadataPair.getLeft(), 
existingRecordMetadataPair.getRight(), readerSchema, payloadProps);
+  if (incomingOrderingValue.compareTo(existingOrderingValue) > 0) {
+return Option.of(record);
+  }
+  return Option.empty();

Review Comment:
   Why empty? Should it not be from `existingRecordMetadataPair`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780038520

   
   ## CI report:
   
   * 57481f626caf8864def8394c57316535fa490b90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20490)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1372307509


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java:
##
@@ -195,6 +206,10 @@ public HoodieKey getKey() {
 return key;
   }
 
+  public boolean isPartial() {
+return isPartial;

Review Comment:
   I removed all changes to the `HoodieRecord` and subclasses.  Now whether a 
record is partial or not is determined by the schema attached, which is per log 
file.  Checking whether a schema is partial or not also leverages cache (see 
`SparkPartialMergingUtils`), so there is no overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9883:
URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780026429

   
   ## CI report:
   
   * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469)
 
   * 57481f626caf8864def8394c57316535fa490b90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]

2023-10-25 Thread via GitHub


yihua commented on code in PR #9883:
URL: https://github.com/apache/hudi/pull/9883#discussion_r1372304952


##
hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java:
##
@@ -67,6 +70,7 @@ public abstract class HoodieReaderContext {
* file.
*
* @param filePath   {@link Path} instance of a file.
+   * @param isLogFile  Whether this is a log file.
* @param start  Starting byte to start reading.

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779951472

   
   ## CI report:
   
   * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20489)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779872587

   
   ## CI report:
   
   * 75e98fe81be61e02f30d41d798ea86b733a26e2a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20448)
 
   * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20489)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9761:
URL: https://github.com/apache/hudi/pull/9761#issuecomment-1779872241

   
   ## CI report:
   
   * 4ec731d4168128cc93e3be5d7f6c444aceacb970 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20484)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9894:
URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779858774

   
   ## CI report:
   
   * 75e98fe81be61e02f30d41d798ea86b733a26e2a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20448)
 
   * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779844736

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]

2023-10-25 Thread via GitHub


yihua merged PR #9912:
URL: https://github.com/apache/hudi/pull/9912


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Add table name and range msg for streaming reads logs (#9912)

2023-10-25 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 250456f3fba [MINOR] Add table name and range msg for streaming reads 
logs (#9912)
250456f3fba is described below

commit 250456f3fba70d35a0cc8445d143d187bd3abd7e
Author: zhuanshenbsj1 <34104400+zhuanshenb...@users.noreply.github.com>
AuthorDate: Thu Oct 26 02:06:24 2023 +0800

[MINOR] Add table name and range msg for streaming reads logs (#9912)
---
 .../main/java/org/apache/hudi/common/table/log/InstantRange.java | 9 +
 .../org/apache/hudi/source/StreamReadMonitoringFunction.java | 3 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java
index 6609ad085ef..96c7b0c0ddf 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java
@@ -57,6 +57,15 @@ public abstract class InstantRange implements Serializable {
 
   public abstract boolean isInRange(String instant);
 
+  @Override
+  public String toString() {
+return "InstantRange{"
++ "startInstant='" + startInstant == null ? "null" : startInstant + 
'\''
++ ", endInstant='" + endInstant == null ? "null" : endInstant + '\''
++ ", rangeType='" + this.getClass().getSimpleName() + '\''
++ '}';
+  }
+
   // -
   //  Inner Class
   // -
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
index 6f0fd9253e2..86e32fe5a0a 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java
@@ -226,9 +226,10 @@ public class StreamReadMonitoringFunction
 this.issuedOffset = result.getOffset();
 LOG.info("\n"
 + "\n"
++ "-- table: {}\n"
 + "-- consumed to instant: {}\n"
 + "",
-this.issuedInstant);
+conf.getString(FlinkOptions.TABLE_NAME), this.issuedInstant);
   }
 
   @Override



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779778741

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779765638

   
   ## CI report:
   
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779765466

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]

2023-10-25 Thread via GitHub


fenil25 commented on issue #9915:
URL: https://github.com/apache/hudi/issues/9915#issuecomment-1779758399

   Got it. Thanks @ad1happy2go  
   Are bulk_insert and full_record bootstrap modes the same then? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779751539

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779751137

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463)
 
   * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Async Cleaner OOM / slowdown after creating a large Savepoint [hudi]

2023-10-25 Thread via GitHub


ehurheap commented on issue #9747:
URL: https://github.com/apache/hudi/issues/9747#issuecomment-1779720173

   Interesting. Thanks for the update @ad1happy2go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779680860

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   * 955944c19aa182a5231741fbf20888e517f6dafd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9743:
URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779680453

   
   ## CI report:
   
   * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN
   * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN
   * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463)
 
   * f98cbcb16737a88891703baeee15f5a6bd73e784 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9888:
URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779667511

   
   ## CI report:
   
   * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483)
 
   * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]

2023-10-25 Thread via GitHub


Armelabdelkbir commented on issue #9918:
URL: https://github.com/apache/hudi/issues/9918#issuecomment-1779660398

   missing columnar, do you mean schema evolution, sometimes we have schema 
evolution, but not for this usecase.
   what is the impact of upgrade on production i have hundred of tables and 
billions of rows, i need just to upgrade the hudi version and keep same 
metadata folders ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]

2023-10-25 Thread via GitHub


ad1happy2go commented on issue #9918:
URL: https://github.com/apache/hudi/issues/9918#issuecomment-1779633050

   @Armelabdelkbir I recommend you upgrade your Hudi version to 0.12.3 or 
0.13.1 or 0.14.0. It may happen due to missing column in later records compared 
to previous ones. Do you have any such scenario?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Hudi 0.13.1 compatibility issues with EMR-6.7.0 and EMR-6.11.1 [hudi]

2023-10-25 Thread via GitHub


ad1happy2go commented on issue #9919:
URL: https://github.com/apache/hudi/issues/9919#issuecomment-1779622641

   @Shubham21k I think you are using the wrong utilities bundle jar. There are 
two utility jars - Hudi-utility (contains hudi spark bundle classes also) and 
Hudi-slim-bundle. 
   
   Can you try using Hudi-slim-bundle and spark3.3/spark3.2 bundle jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >