[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-05 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1269321846

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * f21eab07069aa87544e04b115e7463126cd9c472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12015)
 
   * 3dd9e31fa787ee2c4308bca9b2fe691566c51ec5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12022)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-05 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1269318141

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * f21eab07069aa87544e04b115e7463126cd9c472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12015)
 
   * 3dd9e31fa787ee2c4308bca9b2fe691566c51ec5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-05 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1269280015

   
   ## CI report:
   
   * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-05 Thread GitBox


hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1269274673

   
   ## CI report:
   
   * af8e58757bed12e53907076da02add1ba98b220c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12014)
 
   * 261adecadc91712a222905082cad122befe81566 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-05 Thread GitBox


hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1269271854

   
   ## CI report:
   
   * af8e58757bed12e53907076da02add1ba98b220c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12014)
 
   * 261adecadc91712a222905082cad122befe81566 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4605) Upgrade hudi-presto-bundle version to 0.12.0

2022-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4605:
-
Fix Version/s: 0.12.2

> Upgrade hudi-presto-bundle version to 0.12.0
> 
>
> Key: HUDI-4605
> URL: https://issues.apache.org/jira/browse/HUDI-4605
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6876: [MINOR] Handling null event time

2022-10-05 Thread GitBox


hudi-bot commented on PR #6876:
URL: https://github.com/apache/hudi/pull/6876#issuecomment-1269268926

   
   ## CI report:
   
   * 6ce255ff0537ecb4ecf9bf7cf7f2534f7021337b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12019)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4522) [DOCS] Set presto session prop to use parquet column names in case of type mismatch

2022-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4522:
-
Fix Version/s: 0.13.0
   (was: 0.12.0)

> [DOCS] Set presto session prop to use parquet column names in case of type 
> mismatch
> ---
>
> Key: HUDI-4522
> URL: https://issues.apache.org/jira/browse/HUDI-4522
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Léo Biscassi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> See https://github.com/apache/hudi/issues/6142



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6862: [HUDI-4989] fixing deltastreamer init failures

2022-10-05 Thread GitBox


hudi-bot commented on PR #6862:
URL: https://github.com/apache/hudi/pull/6862#issuecomment-1269268848

   
   ## CI report:
   
   * 149aec6ea8ff6d895da07b0226be1efdf920e3d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12016)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4522) [DOCS] Set presto session prop to use parquet column names in case of type mismatch

2022-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4522:
-
Fix Version/s: 0.12.0
   (was: 0.13.0)

> [DOCS] Set presto session prop to use parquet column names in case of type 
> mismatch
> ---
>
> Key: HUDI-4522
> URL: https://issues.apache.org/jira/browse/HUDI-4522
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Léo Biscassi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> See https://github.com/apache/hudi/issues/6142



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3210) [UMBRELLA] Native Presto connector for Hudi

2022-10-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3210:
-
Summary: [UMBRELLA] Native Presto connector for Hudi  (was: [UMBRELLA] A 
new Presto connector for Hudi)

> [UMBRELLA] Native Presto connector for Hudi
> ---
>
> Key: HUDI-3210
> URL: https://issues.apache.org/jira/browse/HUDI-3210
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: trino-presto
>Reporter: Todd Gao
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0, 1.0.0
>
>
> This JIRA tracks all the tasks related to building a new Hudi connector in 
> Presto.
> h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4988) Add Docs regarding Hudi RecordMerger

2022-10-05 Thread Frank Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wong closed HUDI-4988.

Resolution: Fixed

> Add Docs regarding Hudi RecordMerger
> 
>
> Key: HUDI-4988
> URL: https://issues.apache.org/jira/browse/HUDI-4988
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Frank Wong
>Priority: Critical
>
> We need to make sure that we're adding docs explaining
>  - RecordMerger component its API and lifecycle
>  - Relationship w/ Merging Strategy 
>  - Its current limitations (and future evolution)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4988) Add Docs regarding Hudi RecordMerger

2022-10-05 Thread Frank Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wong reassigned HUDI-4988:


Assignee: Frank Wong

> Add Docs regarding Hudi RecordMerger
> 
>
> Key: HUDI-4988
> URL: https://issues.apache.org/jira/browse/HUDI-4988
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Frank Wong
>Priority: Critical
>
> We need to make sure that we're adding docs explaining
>  - RecordMerger component its API and lifecycle
>  - Relationship w/ Merging Strategy 
>  - Its current limitations (and future evolution)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-3217) RFC-46: Optimize Record Payload handling

2022-10-05 Thread Frank Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wong reassigned HUDI-3217:


Assignee: Alexey Kudinkin  (was: Frank Wong)

> RFC-46: Optimize Record Payload handling
> 
>
> Key: HUDI-3217
> URL: https://issues.apache.org/jira/browse/HUDI-3217
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: storage-management, writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.13.0
>
>
> Currently Hudi is biased t/w assumption of particular payload representation 
> (Avro), long-term we would like to steer away from this to keep the record 
> payload be completely opaque, so that
>  # We can keep record payload representation engine-specific
>  # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
> Binary)
> h2. *Proposal*
>  
> *Phase 2: Revisiting Record Handling*
> {_}T-shirt{_}: 2-2.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Revisit RecordPayload APIs
>  ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
> replacing w/ new “opaque” APIs (not returning Avro payloads)
>  ** Rebase RecordPayload hierarchy to be engine-specific:
>  *** Common engine-specific base abstracting common functionality (Spark, 
> Flink, Java)
>  *** Each feature-specific semantic will have to implement for all engines
>  ** Introduce new APIs
>  *** To access keys (record, partition)
>  *** To convert record to Avro (for BWC)
>  * Revisit RecordPayload handling
>  ** In WriteHandles 
>  *** API will be accepting opaque RecordPayload (no Avro conversion)
>  *** Can do (opaque) record merging if necessary
>  *** Passes RP as is to FileWriter
>  ** In FileWriters
>  *** Will accept RecordPayload interface
>  *** Should be engine-specific (to handle internal record representation
>  ** In RecordReaders
>  *** API will be providing opaque RecordPayload (no Avro conversion)
>  
> REF
> [https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HUDI-3217) RFC-46: Optimize Record Payload handling

2022-10-05 Thread Frank Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wong reopened HUDI-3217:
--

> RFC-46: Optimize Record Payload handling
> 
>
> Key: HUDI-3217
> URL: https://issues.apache.org/jira/browse/HUDI-3217
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: storage-management, writer-core
>Reporter: Alexey Kudinkin
>Assignee: Frank Wong
>Priority: Blocker
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.13.0
>
>
> Currently Hudi is biased t/w assumption of particular payload representation 
> (Avro), long-term we would like to steer away from this to keep the record 
> payload be completely opaque, so that
>  # We can keep record payload representation engine-specific
>  # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
> Binary)
> h2. *Proposal*
>  
> *Phase 2: Revisiting Record Handling*
> {_}T-shirt{_}: 2-2.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Revisit RecordPayload APIs
>  ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
> replacing w/ new “opaque” APIs (not returning Avro payloads)
>  ** Rebase RecordPayload hierarchy to be engine-specific:
>  *** Common engine-specific base abstracting common functionality (Spark, 
> Flink, Java)
>  *** Each feature-specific semantic will have to implement for all engines
>  ** Introduce new APIs
>  *** To access keys (record, partition)
>  *** To convert record to Avro (for BWC)
>  * Revisit RecordPayload handling
>  ** In WriteHandles 
>  *** API will be accepting opaque RecordPayload (no Avro conversion)
>  *** Can do (opaque) record merging if necessary
>  *** Passes RP as is to FileWriter
>  ** In FileWriters
>  *** Will accept RecordPayload interface
>  *** Should be engine-specific (to handle internal record representation
>  ** In RecordReaders
>  *** API will be providing opaque RecordPayload (no Avro conversion)
>  
> REF
> [https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3217) RFC-46: Optimize Record Payload handling

2022-10-05 Thread Frank Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Wong updated HUDI-3217:
-
Status: In Progress  (was: Reopened)

> RFC-46: Optimize Record Payload handling
> 
>
> Key: HUDI-3217
> URL: https://issues.apache.org/jira/browse/HUDI-3217
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: storage-management, writer-core
>Reporter: Alexey Kudinkin
>Assignee: Frank Wong
>Priority: Blocker
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.13.0
>
>
> Currently Hudi is biased t/w assumption of particular payload representation 
> (Avro), long-term we would like to steer away from this to keep the record 
> payload be completely opaque, so that
>  # We can keep record payload representation engine-specific
>  # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > 
> Binary)
> h2. *Proposal*
>  
> *Phase 2: Revisiting Record Handling*
> {_}T-shirt{_}: 2-2.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Revisit RecordPayload APIs
>  ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs 
> replacing w/ new “opaque” APIs (not returning Avro payloads)
>  ** Rebase RecordPayload hierarchy to be engine-specific:
>  *** Common engine-specific base abstracting common functionality (Spark, 
> Flink, Java)
>  *** Each feature-specific semantic will have to implement for all engines
>  ** Introduce new APIs
>  *** To access keys (record, partition)
>  *** To convert record to Avro (for BWC)
>  * Revisit RecordPayload handling
>  ** In WriteHandles 
>  *** API will be accepting opaque RecordPayload (no Avro conversion)
>  *** Can do (opaque) record merging if necessary
>  *** Passes RP as is to FileWriter
>  ** In FileWriters
>  *** Will accept RecordPayload interface
>  *** Should be engine-specific (to handle internal record representation
>  ** In RecordReaders
>  *** API will be providing opaque RecordPayload (no Avro conversion)
>  
> REF
> [https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [MINOR] Fix deploy script for flink 1.15 (#6872)

2022-10-05 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fd8a947e61 [MINOR] Fix deploy script for flink 1.15 (#6872)
fd8a947e61 is described below

commit fd8a947e6158ed848c7bb2efb272d833ae5c6442
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Oct 6 10:52:38 2022 +0800

[MINOR] Fix deploy script for flink 1.15 (#6872)
---
 scripts/release/deploy_staging_jars.sh | 2 +-
 scripts/release/validate_staged_bundles.sh | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/release/deploy_staging_jars.sh 
b/scripts/release/deploy_staging_jars.sh
index df4ac84efa..b6b035877e 100755
--- a/scripts/release/deploy_staging_jars.sh
+++ b/scripts/release/deploy_staging_jars.sh
@@ -43,7 +43,7 @@ declare -a ALL_VERSION_OPTS=(
 "-Dscala-2.11 -Dspark2.4 -Dflink1.13"
 "-Dscala-2.11 -Dspark2.4 -Dflink1.14"
 "-Dscala-2.12 -Dspark2.4 -Dflink1.13"
-"-Dscala-2.12 -Dspark3.3 -Dflink1.14"
+"-Dscala-2.12 -Dspark3.3 -Dflink1.15"
 "-Dscala-2.12 -Dspark3.2 -Dflink1.14"
 "-Dscala-2.12 -Dspark3.1 -Dflink1.14" # run this last to make sure utilities 
bundle has spark 3.1
 )
diff --git a/scripts/release/validate_staged_bundles.sh 
b/scripts/release/validate_staged_bundles.sh
index db99dcba12..baf506f944 100755
--- a/scripts/release/validate_staged_bundles.sh
+++ b/scripts/release/validate_staged_bundles.sh
@@ -34,6 +34,7 @@ declare -a BUNDLE_URLS=(
 
"${STAGING_REPO}/hudi-flink1.13-bundle_2.12/${VERSION}/hudi-flink1.13-bundle_2.12-${VERSION}.jar"
 
"${STAGING_REPO}/hudi-flink1.14-bundle_2.11/${VERSION}/hudi-flink1.14-bundle_2.11-${VERSION}.jar"
 
"${STAGING_REPO}/hudi-flink1.14-bundle_2.12/${VERSION}/hudi-flink1.14-bundle_2.12-${VERSION}.jar"
+"${STAGING_REPO}/hudi-flink1.15-bundle/${VERSION}/hudi-flink1.15-bundle-${VERSION}.jar"
 "${STAGING_REPO}/hudi-gcp-bundle/${VERSION}/hudi-gcp-bundle-${VERSION}.jar"
 
"${STAGING_REPO}/hudi-hadoop-mr-bundle/${VERSION}/hudi-hadoop-mr-bundle-${VERSION}.jar"
 
"${STAGING_REPO}/hudi-hive-sync-bundle/${VERSION}/hudi-hive-sync-bundle-${VERSION}.jar"



[GitHub] [hudi] xushiyan merged pull request #6872: [MINOR] Fix deploy script for flink 1.15

2022-10-05 Thread GitBox


xushiyan merged PR #6872:
URL: https://github.com/apache/hudi/pull/6872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6876: [MINOR] Handling null event time

2022-10-05 Thread GitBox


hudi-bot commented on PR #6876:
URL: https://github.com/apache/hudi/pull/6876#issuecomment-1269234032

   
   ## CI report:
   
   * 6ce255ff0537ecb4ecf9bf7cf7f2534f7021337b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4986) Enhance hudi integ test readme for multi-writer tests

2022-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4986:
-
Labels: pull-request-available  (was: )

> Enhance hudi integ test readme for multi-writer tests
> -
>
> Key: HUDI-4986
> URL: https://issues.apache.org/jira/browse/HUDI-4986
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs, tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: Enhancing README for multi-writer tests (#6870)

2022-10-05 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 280194d3b6 Enhancing README for multi-writer tests (#6870)
280194d3b6 is described below

commit 280194d3b6ef6e2181a137dd709f0c8e80d5de3a
Author: Sivabalan Narayanan 
AuthorDate: Wed Oct 5 19:41:52 2022 -0700

Enhancing README for multi-writer tests (#6870)
---
 hudi-integ-test/README.md | 52 ++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index 687ad9a2a9..bea9219294 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -525,7 +525,44 @@ Spark submit with the flag:
 ### Multi-writer tests
 Integ test framework also supports multi-writer tests. 
 
- Multi-writer tests with deltastreamer and a spark data source writer. 
+ Multi-writer tests with deltastreamer and a spark data source writer.
+
+Props of interest
+Top level configs:
+- --target-base-path refers to target hudi table base path
+- --input-base-paths comma separated input paths. If you plan to spin up two 
writers, this should contain input dir for two. 
+- --props-paths comma separated property file paths. Again, if you plan to 
spin up two writers, this should contain the property file for each writer. 
+- --workload-yaml-paths comma separated workload yaml files for each writer. 
+
+Configs in property file:
+- hoodie.deltastreamer.source.dfs.root : This property should refer to input 
dir for each writer in their corresponding property file. 
+In other words, this should match w/ --input-base-paths. 
+- hoodie.deltastreamer.schemaprovider.target.schema.file : refers to target 
schema. If you are running in docker, do copy the source avsc file to docker as 
well. 
+- hoodie.deltastreamer.schemaprovider.source.schema.file : refer to source 
schema. Same as above (copy to docker if needed)
+
+We have sample properties file to use based on whether InProcessLockProvider 
is used or ZookeeperBasedLockProvider is used. 
+
+multi-writer-local-1.properties
+multi-writer-local-2.properties
+multi-writer-local-3.properties
+multi-writer-local-4.properties
+
+These have configs that uses InProcessLockProvider. Configs specifc to 
InProcessLockProvider is:
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
+
+multi-writer-1.properties
+multi-writer-2.properties
+
+These have configs that uses ZookeeperBasedLockProvider. Setting up zookeeper 
is outside of the scope of this README. Ensure 
+zookeeper is up before running these. Configs specific to 
ZookeeperBasedLockProvider: 
+
+hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
+hoodie.write.lock.zookeeper.url=zookeeper:2181
+hoodie.write.lock.zookeeper.port=2181
+hoodie.write.lock.zookeeper.lock_key=locks
+hoodie.write.lock.zookeeper.base_path=/tmp/.locks
+
+If you are running locally, ensure you update the schema file accordingly. 
 
 Sample spark-submit command to test one delta streamer and a spark data source 
writer. 
 ```shell
@@ -593,6 +630,19 @@ Sample spark-submit command to test one delta streamer and 
a spark data source w
 --use-hudi-data-to-generate-updates
 ```
 
+Properties that differ from previous scenario and this one are:
+--input-base-paths refers to 4 paths instead of 2
+--props-paths again, refers to 4 paths intead of 2.
+  -- Each property file will contain properties for one spark datasource 
writer. 
+--workload-yaml-paths refers to 4 paths instead of 2.
+  -- Each yaml file used different range of partitions so that there won't be 
any conflicts while doing concurrent writes.
+
+MOR Table: 
+Running multi-writer tests for COW woks for entire iteration. but w/ MOR 
table, sometimes one of the writer could fail stating that 
+there is already a scheduled delta commit. In general, while scheduling 
compaction, there should not be inflight delta commits. 
+But w/ multiple threads trying to ingest in their own frequency, this is 
unavoidable. After few iterations, one of your thread could 
+die because there is an inflight delta commit from another writer.
+
 ===
 ### Testing async table services
 We can test async table services with deltastreamer using below command. 3 
additional arguments are required to test async 



[GitHub] [hudi] codope merged pull request #6870: [HUDI-4986] Enhancing README for multi-writer tests

2022-10-05 Thread GitBox


codope merged PR #6870:
URL: https://github.com/apache/hudi/pull/6870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create (#6857)

2022-10-05 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fb4f026580 [HUDI-4970] Update kafka-connect readme and refactor 
HoodieConfig#create (#6857)
fb4f026580 is described below

commit fb4f02658050a74179338d4cfba07ceabe688c53
Author: Sagar Sumit 
AuthorDate: Thu Oct 6 08:11:35 2022 +0530

[HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create 
(#6857)
---
 .../apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java |  6 +++---
 .../org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java   |  6 +++---
 .../main/java/org/apache/hudi/common/config/HoodieConfig.java |  9 +
 hudi-kafka-connect/README.md  | 11 +++
 .../sql/hudi/procedure/TestUpgradeOrDowngradeProcedure.scala  |  5 +++--
 5 files changed, 17 insertions(+), 20 deletions(-)

diff --git 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
index ed4c952824..ff983d44ae 100644
--- 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
+++ 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
@@ -164,9 +164,9 @@ public class TestUpgradeDowngradeCommand extends 
CLIFunctionalTestHarness {
 Path propertyFile = new Path(metaClient.getMetaPath() + "/" + 
HoodieTableConfig.HOODIE_PROPERTIES_FILE);
 // Load the properties and verify
 FSDataInputStream fsDataInputStream = 
metaClient.getFs().open(propertyFile);
-HoodieConfig hoodieConfig = HoodieConfig.create(fsDataInputStream);
+HoodieConfig config = new HoodieConfig();
+config.getProps().load(fsDataInputStream);
 fsDataInputStream.close();
-assertEquals(Integer.toString(expectedVersion.versionCode()), hoodieConfig
-.getString(HoodieTableConfig.VERSION));
+assertEquals(Integer.toString(expectedVersion.versionCode()), 
config.getString(HoodieTableConfig.VERSION));
   }
 }
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java
index 39dbacabac..64ee23c35e 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java
@@ -770,9 +770,9 @@ public class TestUpgradeDowngrade extends 
HoodieClientTestBase {
 Path propertyFile = new Path(metaClient.getMetaPath() + "/" + 
HoodieTableConfig.HOODIE_PROPERTIES_FILE);
 // Load the properties and verify
 FSDataInputStream fsDataInputStream = 
metaClient.getFs().open(propertyFile);
-HoodieConfig hoodieConfig = HoodieConfig.create(fsDataInputStream);
+HoodieConfig config = new HoodieConfig();
+config.getProps().load(fsDataInputStream);
 fsDataInputStream.close();
-assertEquals(Integer.toString(expectedVersion.versionCode()), hoodieConfig
-.getString(HoodieTableConfig.VERSION));
+assertEquals(Integer.toString(expectedVersion.versionCode()), 
config.getString(HoodieTableConfig.VERSION));
   }
 }
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java 
b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
index 366d19fe6e..91f0671cf9 100644
--- a/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
+++ b/hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java
@@ -18,15 +18,14 @@
 
 package org.apache.hudi.common.config;
 
-import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.exception.HoodieException;
+
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
-import java.io.IOException;
 import java.io.Serializable;
 import java.lang.reflect.Modifier;
 import java.util.Arrays;
@@ -42,12 +41,6 @@ public class HoodieConfig implements Serializable {
 
   protected static final String CONFIG_VALUES_DELIMITER = ",";
 
-  public static HoodieConfig create(FSDataInputStream inputStream) throws 
IOException {
-HoodieConfig config = new HoodieConfig();
-config.props.load(inputStream);
-return config;
-  }
-
   protected TypedProperties props;
 
   public HoodieConfig() {
diff --git a/hudi-kafka-connect/README.md b/hudi-kafka-connect/README.md
index 449236ea5c..a1d6f812c1 100644
--- a/hudi-kafka-connect/README.md
+++ b/hudi-kafka-connect/README.md
@@ -36,10 +36,10 @@ After installing these dependencies, follow steps based on 
your requirement.
 
 ### 1 - 

[GitHub] [hudi] codope merged pull request #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


codope merged PR #6857:
URL: https://github.com/apache/hudi/pull/6857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on pull request #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


codope commented on PR #6857:
URL: https://github.com/apache/hudi/pull/6857#issuecomment-1269231905

   Landing it. Just a readme and test update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: Revert "[HUDI-4915] improve avro serializer/deserializer (#6788)" (#6809)

2022-10-05 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 067cc24d88 Revert "[HUDI-4915] improve avro serializer/deserializer 
(#6788)" (#6809)
067cc24d88 is described below

commit 067cc24d88fd299f1dfc8b96a1995621799613d4
Author: Yann Byron 
AuthorDate: Thu Oct 6 10:40:55 2022 +0800

Revert "[HUDI-4915] improve avro serializer/deserializer (#6788)" (#6809)

This reverts commit 79b3e2b899cc303490c22610fda0e5ac2013cf02.
---
 .../org/apache/spark/sql/avro/AvroDeserializer.scala | 20 +---
 .../org/apache/spark/sql/avro/AvroSerializer.scala   | 17 +++--
 .../org/apache/spark/sql/avro/AvroDeserializer.scala | 20 +---
 .../org/apache/spark/sql/avro/AvroSerializer.scala   | 19 ---
 .../org/apache/spark/sql/avro/AvroDeserializer.scala | 20 +---
 .../org/apache/spark/sql/avro/AvroSerializer.scala   | 19 ---
 .../org/apache/spark/sql/avro/AvroDeserializer.scala | 20 +---
 .../org/apache/spark/sql/avro/AvroSerializer.scala   | 19 ---
 8 files changed, 99 insertions(+), 55 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index 921e6deb58..9725fb63f5 100644
--- 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -49,27 +49,33 @@ import scala.collection.mutable.ArrayBuffer
 class AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType) {
   private lazy val decimalConversions = new DecimalConversion()
 
-  def deserialize(data: Any): Any = rootCatalystType match {
+  private val converter: Any => Any = rootCatalystType match {
 // A shortcut for empty schema.
 case st: StructType if st.isEmpty =>
-  InternalRow.empty
+  (data: Any) => InternalRow.empty
 
 case st: StructType =>
   val resultRow = new SpecificInternalRow(st.map(_.dataType))
   val fieldUpdater = new RowUpdater(resultRow)
   val writer = getRecordWriter(rootAvroType, st, Nil)
-  val record = data.asInstanceOf[GenericRecord]
-  writer(fieldUpdater, record)
-  resultRow
+  (data: Any) => {
+val record = data.asInstanceOf[GenericRecord]
+writer(fieldUpdater, record)
+resultRow
+  }
 
 case _ =>
   val tmpRow = new SpecificInternalRow(Seq(rootCatalystType))
   val fieldUpdater = new RowUpdater(tmpRow)
   val writer = newWriter(rootAvroType, rootCatalystType, Nil)
-  writer(fieldUpdater, 0, data)
-  tmpRow.get(0, rootCatalystType)
+  (data: Any) => {
+writer(fieldUpdater, 0, data)
+tmpRow.get(0, rootCatalystType)
+  }
   }
 
+  def deserialize(data: Any): Any = converter(data)
+
   /**
* Creates a writer to write avro values to Catalyst values at the given 
ordinal with the given
* updater.
diff --git 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
index e0c7344138..2b88be8165 100644
--- 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
+++ 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
@@ -47,6 +47,10 @@ import org.apache.spark.sql.types._
 class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, 
nullable: Boolean) {
 
   def serialize(catalystData: Any): Any = {
+converter.apply(catalystData)
+  }
+
+  private val converter: Any => Any = {
 val actualAvroType = resolveNullableType(rootAvroType, nullable)
 val baseConverter = rootCatalystType match {
   case st: StructType =>
@@ -59,13 +63,14 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   converter.apply(tmpRow, 0)
 }
 if (nullable) {
-  if (catalystData == null) {
-null
-  } else {
-baseConverter.apply(catalystData)
-  }
+  (data: Any) =>
+if (data == null) {
+  null
+} else {
+  baseConverter.apply(data)
+}
 } else {
-  baseConverter.apply(catalystData)
+  baseConverter
 }
   }
 
diff --git 
a/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index 61482ab96f..5fb6d907bd 100644
--- 

[GitHub] [hudi] xushiyan merged pull request #6809: Revert "[HUDI-4915] improve avro serializer/deserializer (#6788)"

2022-10-05 Thread GitBox


xushiyan merged PR #6809:
URL: https://github.com/apache/hudi/pull/6809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-05 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1269231396

   
   ## CI report:
   
   * 23d923e6b8c75781053f3f7bbc811084141f7786 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11978)
 
   * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #6876: [MINOR] Handling null event time

2022-10-05 Thread GitBox


nsivabalan opened a new pull request, #6876:
URL: https://github.com/apache/hudi/pull/6876

   ### Change Logs
   
   Seeing noisy debug logs (`Fail to parse event time value`) with tests. 
Fixing null event time handling. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-05 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1269228467

   
   ## CI report:
   
   * 23d923e6b8c75781053f3f7bbc811084141f7786 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11978)
 
   * e246d65957362860b850f1af9ef973b85bf1a4eb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-05 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1269225293

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * f21eab07069aa87544e04b115e7463126cd9c472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-05 Thread GitBox


nsivabalan commented on code in PR #6836:
URL: https://github.com/apache/hudi/pull/6836#discussion_r988495539


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -293,16 +293,20 @@ object HoodieFileIndex extends Logging {
 schema.fieldNames.filter { colName => refs.exists(r => 
resolver.apply(colName, r.name)) }
   }
 
-  def getConfigProperties(spark: SparkSession, options: Map[String, String]) = 
{
+  private def isFilesPartitionAvailable(metaClient: HoodieTableMetaClient): 
Boolean = {
+
metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_FILES)
+  }
+
+  def getConfigProperties(spark: SparkSession, options: Map[String, String], 
metaClient: HoodieTableMetaClient) = {
 val sqlConf: SQLConf = spark.sessionState.conf
 val properties = new TypedProperties()
 
 // To support metadata listing via Spark SQL we allow users to pass the 
config via SQL Conf in spark session. Users
 // would be able to run SET hoodie.metadata.enable=true in the spark sql 
session to enable metadata listing.
-properties.setProperty(HoodieMetadataConfig.ENABLE.key(),
-  sqlConf.getConfString(HoodieMetadataConfig.ENABLE.key(),
-HoodieMetadataConfig.DEFAULT_METADATA_ENABLE_FOR_READERS.toString))
-properties.putAll(options.filter(p => p._2 != null).asJava)
+val isMetadataFilesPartitionAvailable = 
isFilesPartitionAvailable(metaClient) && 
sqlConf.getConfString(HoodieMetadataConfig.ENABLE.key(),

Review Comment:
   Isn't this the entry point to metadata table on the read path? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6862: [HUDI-4989] fixing deltastreamer init failures

2022-10-05 Thread GitBox


hudi-bot commented on PR #6862:
URL: https://github.com/apache/hudi/pull/6862#issuecomment-1269183442

   
   ## CI report:
   
   * 149aec6ea8ff6d895da07b0226be1efdf920e3d8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12016)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4989) Deltastreamer fails if table instantiation failed mid-way in prior attempt

2022-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4989:
-
Labels: pull-request-available  (was: )

> Deltastreamer fails if table instantiation failed mid-way in prior attempt
> --
>
> Key: HUDI-4989
> URL: https://issues.apache.org/jira/browse/HUDI-4989
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> If table instantiation failed mid-way, and if deltastreamer is restarted, it 
> could fail if hoodie.properties does not exist. 
>  
> we need to make it resilient in such cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6862: [HUDI-4989] fixing deltastreamer init failures

2022-10-05 Thread GitBox


hudi-bot commented on PR #6862:
URL: https://github.com/apache/hudi/pull/6862#issuecomment-1269180039

   
   ## CI report:
   
   * 149aec6ea8ff6d895da07b0226be1efdf920e3d8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4989) Deltastreamer fails if table instantiation failed mid-way in prior attempt

2022-10-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4989:
--
Priority: Critical  (was: Major)

> Deltastreamer fails if table instantiation failed mid-way in prior attempt
> --
>
> Key: HUDI-4989
> URL: https://issues.apache.org/jira/browse/HUDI-4989
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.13.0
>
>
> If table instantiation failed mid-way, and if deltastreamer is restarted, it 
> could fail if hoodie.properties does not exist. 
>  
> we need to make it resilient in such cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4989) Deltastreamer fails if table instantiation failed mid-way in prior attempt

2022-10-05 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-4989:
-

 Summary: Deltastreamer fails if table instantiation failed mid-way 
in prior attempt
 Key: HUDI-4989
 URL: https://issues.apache.org/jira/browse/HUDI-4989
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: sivabalan narayanan


If table instantiation failed mid-way, and if deltastreamer is restarted, it 
could fail if hoodie.properties does not exist. 

 

we need to make it resilient in such cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4989) Deltastreamer fails if table instantiation failed mid-way in prior attempt

2022-10-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4989:
--
Fix Version/s: 0.13.0

> Deltastreamer fails if table instantiation failed mid-way in prior attempt
> --
>
> Key: HUDI-4989
> URL: https://issues.apache.org/jira/browse/HUDI-4989
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.13.0
>
>
> If table instantiation failed mid-way, and if deltastreamer is restarted, it 
> could fail if hoodie.properties does not exist. 
>  
> we need to make it resilient in such cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4989) Deltastreamer fails if table instantiation failed mid-way in prior attempt

2022-10-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4989:
-

Assignee: sivabalan narayanan

> Deltastreamer fails if table instantiation failed mid-way in prior attempt
> --
>
> Key: HUDI-4989
> URL: https://issues.apache.org/jira/browse/HUDI-4989
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> If table instantiation failed mid-way, and if deltastreamer is restarted, it 
> could fail if hoodie.properties does not exist. 
>  
> we need to make it resilient in such cases. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-05 Thread GitBox


yihua commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r988454552


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async conflict 

[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-05 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1269130978

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 91655a8009d60f6337939f87d6e2e01922877848 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12002)
 
   * f21eab07069aa87544e04b115e7463126cd9c472 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-05 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1269127142

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 91655a8009d60f6337939f87d6e2e01922877848 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12002)
 
   * f21eab07069aa87544e04b115e7463126cd9c472 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-05 Thread GitBox


hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1269065121

   
   ## CI report:
   
   * af8e58757bed12e53907076da02add1ba98b220c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-05 Thread GitBox


the-other-tim-brown commented on code in PR #6806:
URL: https://github.com/apache/hudi/pull/6806#discussion_r988360500


##
hudi-utilities/pom.xml:
##
@@ -85,7 +85,6 @@
 
   com.google.protobuf
   protobuf-java-util
-  test

Review Comment:
   Should I revert that last commit then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-05 Thread GitBox


hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1268931103

   
   ## CI report:
   
   * 7525a09b2415fbf4e5e7de7c71cfffd8afc8c410 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12011)
 
   * af8e58757bed12e53907076da02add1ba98b220c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-10-05 Thread GitBox


hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1268925125

   
   ## CI report:
   
   * 7525a09b2415fbf4e5e7de7c71cfffd8afc8c410 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12011)
 
   * af8e58757bed12e53907076da02add1ba98b220c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-05 Thread GitBox


hudi-bot commented on PR #6806:
URL: https://github.com/apache/hudi/pull/6806#issuecomment-1268918966

   
   ## CI report:
   
   * f03f9610cf4e2c490d33ca734ca9b3241b2be778 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12012)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6858: [HUDI-4824]RANGE BUCKET index, base logic and test.

2022-10-05 Thread GitBox


hudi-bot commented on PR #6858:
URL: https://github.com/apache/hudi/pull/6858#issuecomment-1265746656

   
   ## CI report:
   
   * 76d14eb325d62a026248cc5c30de0d415e0c92a2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11975)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6858: [HUDI-4824]RANGE BUCKET index, base logic and test.

2022-10-05 Thread GitBox


hudi-bot commented on PR #6858:
URL: https://github.com/apache/hudi/pull/6858#issuecomment-1265741058

   
   ## CI report:
   
   * 76d14eb325d62a026248cc5c30de0d415e0c92a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wqwl611 commented on pull request #6636: [HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

2022-10-05 Thread GitBox


wqwl611 commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1265711269

   I accidentally deleted my previous fork, so I opened a new PR, and modified 
the code according to the previous review requirements, please help me review 
it,thanks.
   @danny0405 @YuweiXiao @xushiyan @alexeykudinkin 
   
   https://github.com/apache/hudi/pull/6858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wqwl611 commented on pull request #6858: [HUDI-4824]RANGE Bucket index, base logic and test.

2022-10-05 Thread GitBox


wqwl611 commented on PR #6858:
URL: https://github.com/apache/hudi/pull/6858#issuecomment-1265709727

   I accidentally deleted my previous fork, so I opened a new PR, and modified 
the code according to the previous review requirements, please help me review 
it,thanks.
   @danny0405 @YuweiXiao @xushiyan @alexeykudinkin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wqwl611 opened a new pull request, #6858: [HUDI-4824]RANGE Bucket index, base logic and test.

2022-10-05 Thread GitBox


wqwl611 opened a new pull request, #6858:
URL: https://github.com/apache/hudi/pull/6858

   ### Change Logs
   The rangeBucket is mainly used in the scenario of sync mysql tables to hudi 
in near real time, which avoids the disadvantage of the fixed number of buckets 
in simpleBucket.
   
   Usually, in the mysql table, there is an auto-increment id primary key 
field. In the mysql cdc synchronization scenario, we can use the database name 
and table name as the partition field of hudi, and id as the primary key field 
of the hudi table,This can deal with sub-library and sub-table. 
   
   In order to reach better sync performance, we usually use bucket index, but 
if we use simple bucket index, because the number of buckets is fixed, it is 
difficult for us to determine a suitable number of buckets, and as the table 
grows, The previous number of buckets will no longer be appropriate. 
   
   So, I propose rangeBucekt, in the simpleBucket index, the bucket number is 
(hash % bucketNum), and in rangetBucket, we will use ( id / fixedStep) to 
determine the bucket number, so that as the id grows,The number of buckets also 
increases. For example, if step = 10 is set, then, because the id is 
self-increasing, a bucket will be generated for every 10 pieces of data.
   
   In the actual scenario, I set step=1,000,000, the usual size of each mysql 
record is similar, then the approximate size of each bucket will be 50M ~ 350M, 
which avoids the disadvantage of the fixed number of buckets in simpleBucket
   
   
   
   ### Impact
   
   Introduce a new index RANGE_BUCKET, people can ust it like following:
   
 option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
 option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
 option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
 option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
   
   **Risk level: none | low | medium | high**
   
   low
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


hudi-bot commented on PR #6857:
URL: https://github.com/apache/hudi/pull/6857#issuecomment-1265637355

   
   ## CI report:
   
   * d4f9276e7a3802fb2df5c3b7e28c224e4a1e7f15 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11974)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on issue #6332: Avoid creating Configuration copies in Hudi

2022-10-05 Thread GitBox


codope commented on issue #6332:
URL: https://github.com/apache/hudi/issues/6332#issuecomment-1265613884

   Synced up with @pratyakshsharma regarding this issue. First of all, the 
issue affects hudi tables queries via presto-hive connector. We need to see if 
we can use the config provided by the engine itself while instantiating the 
meta client. Created HUDI-4974 to track that issue. For now, we have a 
mitigation. We will unwrap the wrapper config object and pass that. 
   
   Closing the issue as it has been triaged and we have an interim solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope closed issue #6332: Avoid creating Configuration copies in Hudi

2022-10-05 Thread GitBox


codope closed issue #6332: Avoid creating Configuration copies in Hudi
URL: https://github.com/apache/hudi/issues/6332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4970) hudi-kafka-connect-bundle: Could not initialize class org.apache.hadoop.security.UserGroupInformation

2022-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4970:
-
Labels: pull-request-available  (was: )

> hudi-kafka-connect-bundle: Could not initialize class 
> org.apache.hadoop.security.UserGroupInformation
> -
>
> Key: HUDI-4970
> URL: https://issues.apache.org/jira/browse/HUDI-4970
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> The Kafka connect sink loads successfully but fails to sync Hudi table due to 
> NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.security.UserGroupInformation
> {code:java}
> [2022-10-03 14:31:49,872] INFO The value of 
> hoodie.datasource.write.keygenerator.type is empty, using SIMPLE 
> (org.apache.hudi.keygen.factory.HoodieAvroKeyGeneratorFactory:63)[2022-10-03 
> 14:31:49,872] INFO Setting record key volume and partition fields date for 
> table file:///tmp/hoodie/hudi-test-topichudi-test-topic 
> (org.apache.hudi.connect.writers.KafkaConnectTransactionServices:93)[2022-10-03
>  14:31:49,872] INFO Initializing file:///tmp/hoodie/hudi-test-topic as hoodie 
> table file:///tmp/hoodie/hudi-test-topic 
> (org.apache.hudi.common.table.HoodieTableMetaClient:424)[2022-10-03 
> 14:31:49,872] INFO Existing partitions deleted [hudi-test-topic-0] 
> (org.apache.hudi.connect.HoodieSinkTask:156)[2022-10-03 14:31:49,872] ERROR 
> WorkerSinkTask{id=hudi-sink-3} Task threw an uncaught and unrecoverable 
> exception. Task is being killed and will not recover until manually restarted 
> (org.apache.kafka.connect.runtime.WorkerTask:184)java.lang.NoClassDefFoundError:
>  Could not initialize class org.apache.hadoop.security.UserGroupInformation   
> at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3431) 
>   at 
> org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:3421)   
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3263)  at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)   at 
> org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:110)at 
> org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:103)at 
> org.apache.hudi.common.table.HoodieTableMetaClient.initTableAndGetMetaClient(HoodieTableMetaClient.java:426)
>  at 
> org.apache.hudi.common.table.HoodieTableMetaClient$PropertyBuilder.initTable(HoodieTableMetaClient.java:1110)
> at 
> org.apache.hudi.connect.writers.KafkaConnectTransactionServices.(KafkaConnectTransactionServices.java:104)
>  at 
> org.apache.hudi.connect.transaction.ConnectTransactionCoordinator.(ConnectTransactionCoordinator.java:88)
>   at 
> org.apache.hudi.connect.HoodieSinkTask.bootstrap(HoodieSinkTask.java:191)
> at org.apache.hudi.connect.HoodieSinkTask.open(HoodieSinkTask.java:151) at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:635)
>   at 
> org.apache.kafka.connect.runtime.WorkerSinkTask.access$1000(WorkerSinkTask.java:71){code}
> Follow [https://github.com/apache/hudi/tree/master/hudi-kafka-connect#readme] 
> to reproduce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


hudi-bot commented on PR #6857:
URL: https://github.com/apache/hudi/pull/6857#issuecomment-1265411902

   
   ## CI report:
   
   * d4f9276e7a3802fb2df5c3b7e28c224e4a1e7f15 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11974)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


hudi-bot commented on PR #6857:
URL: https://github.com/apache/hudi/pull/6857#issuecomment-1265405407

   
   ## CI report:
   
   * d4f9276e7a3802fb2df5c3b7e28c224e4a1e7f15 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-05 Thread GitBox


hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1265398895

   
   ## CI report:
   
   * 4b1c2e6a4a256989d070a105cdd88ef02aaa8fc1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11972)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope opened a new pull request, #6857: [HUDI-4970] Update kafka-connect readme and refactor HoodieConfig#create

2022-10-05 Thread GitBox


codope opened a new pull request, #6857:
URL: https://github.com/apache/hudi/pull/6857

   ### Change Logs
   
   Update Kafka-connect setup with some details. Also, remove 
`HoodieConfig#create`, which is being used only in tests an unnecessarily class 
loader has to load `org.apache.hadoop.fs.FSDataInputStream`.
   
   ### Impact
   
   Refactoring does not have any impact in this case.
   
   **Risk level: none**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1265314521

   
   ## CI report:
   
   * 3b01f5fd8a8be1d5b7dfca7adc882771f7fa787d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11970)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope merged pull request #6846: [HUDI-4962] Move cloud dependencies to cloud modules

2022-10-05 Thread GitBox


codope merged PR #6846:
URL: https://github.com/apache/hudi/pull/6846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] pramodbiligiri commented on pull request #6846: [HUDI-4962] Move cloud dependencies to cloud modules

2022-10-05 Thread GitBox


pramodbiligiri commented on PR #6846:
URL: https://github.com/apache/hudi/pull/6846#issuecomment-1265279684

   Tested that this build worked locally. Built this branch as follows and ran 
the sync. Results after sync pasted down below.
   Build:
   $ mvn -DskipTests -Dspark3.2 -Dscala-2.12 -Dcheckstyle.skip -Drat.skip clean 
install
   
   Testing the sync (each file has 10 records):
   ```
   Before Sync:
   select max(id) from gcs_data;
   
   ++
   |  _c0   |
   ++
   | 20370  |
   ++
   
   select max(name) from gcs_meta_hive;
   +--+
   |   _c0|
   +--+
   | country=IN/data-file-2040.jsonl  |
   +--+
   
   
   After Sync:
   select max(id) from gcs_data;
   ++
   |  _c0   |
   ++
   | 20450  |
   ++
   
   select max(name) from gcs_meta_hive;
   +--+
   |   _c0|
   +--+
   | country=IN/data-file-2045.jsonl  |
   +--+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-05 Thread GitBox


hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1265246778

   
   ## CI report:
   
   * 13a464e9dca3394ed7d946c0e682ad02f7edfc43 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11971)
 
   * 4b1c2e6a4a256989d070a105cdd88ef02aaa8fc1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11972)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

2022-10-05 Thread GitBox


hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265241676

   
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dragonH commented on issue #6832: [SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive sync issue)

2022-10-05 Thread GitBox


dragonH commented on issue #6832:
URL: https://github.com/apache/hudi/issues/6832#issuecomment-1265226296

   @kazdy thanks for the information!
   
   wonder if there's better way to avoid this kind of issue
   
   e.g.
   
   - add a new config property `AWSGlueDataCataglogEnabled` and if it set as 
`True`, convert the table name to lowercase at the begin and at the rest
   
   - or edit the `table-name-related` config property hint to highlight this 
(if using aws glue data catalog, table name sould be lowercase)
   
   cause the original one and the exception raised could not really tell about 
this
   
   thanks for your help !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-05 Thread GitBox


hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1265171581

   
   ## CI report:
   
   * 13a464e9dca3394ed7d946c0e682ad02f7edfc43 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11971)
 
   * 4b1c2e6a4a256989d070a105cdd88ef02aaa8fc1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1265165246

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11948)
 
   * 3b01f5fd8a8be1d5b7dfca7adc882771f7fa787d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11970)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1265159045

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11948)
 
   * 3b01f5fd8a8be1d5b7dfca7adc882771f7fa787d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-05 Thread GitBox


hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1265159167

   
   ## CI report:
   
   * e3aef767db19eed24222f8fff89ae4c59d0799c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11956)
 
   * 13a464e9dca3394ed7d946c0e682ad02f7edfc43 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6850: [Draft][HUDI-4964] inline all the getter methods that have no logic …

2022-10-05 Thread GitBox


hudi-bot commented on PR #6850:
URL: https://github.com/apache/hudi/pull/6850#issuecomment-1265165358

   
   ## CI report:
   
   * e3aef767db19eed24222f8fff89ae4c59d0799c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11956)
 
   * 13a464e9dca3394ed7d946c0e682ad02f7edfc43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11971)
 
   * 4b1c2e6a4a256989d070a105cdd88ef02aaa8fc1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

2022-10-05 Thread GitBox


hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265078737

   
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous opened a new pull request, #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

2022-10-05 Thread GitBox


voonhous opened a new pull request, #6856:
URL: https://github.com/apache/hudi/pull/6856

   ### Change Logs
   
   Update misleading `read.streaming.skip_compaction` config. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hkszn commented on issue #6718: [SUPPORT] Deltastreamer concurrent writes in continuous mode

2022-10-05 Thread GitBox


hkszn commented on issue #6718:
URL: https://github.com/apache/hudi/issues/6718#issuecomment-1265049923

   Thank you for your reply.
   
   > If you are interested, I can guide you on how to achieve this.
   
   Yes, I would like to try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

2022-10-05 Thread GitBox


hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265072776

   
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #6832: [SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive sync issue)

2022-10-05 Thread GitBox


kazdy commented on issue #6832:
URL: https://github.com/apache/hudi/issues/6832#issuecomment-1265036326

   Btw on emr you'll get the same error because glue client is the same. I got 
this error when running hudi on emr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4968) Fix ambiguous stream read config

2022-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4968:
-
Labels: pull-request-available  (was: )

> Fix ambiguous stream read config
> 
>
> Key: HUDI-4968
> URL: https://issues.apache.org/jira/browse/HUDI-4968
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: voon
>Priority: Major
>  Labels: pull-request-available
>
> Fix ambiguous stream read configs by:
>  # Updating the relevant versioned (versions >= 0.11.0) markdown pages to 
> change _read.streaming.start-commit_ to _read.start-commit_
>  # Update the _read.streaming.skip_compaction_ description to accurately 
> describe when it should be used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1265000971

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous opened a new pull request, #6855: [HUDI-4968] Update old config keys

2022-10-05 Thread GitBox


voonhous opened a new pull request, #6855:
URL: https://github.com/apache/hudi/pull/6855

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #6852: [MINOR] Fix testUpdateRejectForClustering

2022-10-05 Thread GitBox


xushiyan commented on PR #6852:
URL: https://github.com/apache/hudi/pull/6852#issuecomment-1264955767

   Test fix works. CI failure is irrelevant. Landing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1264952448

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #6732: [HUDI-4148] Add client for hudi table management service

2022-10-05 Thread GitBox


xushiyan commented on code in PR #6732:
URL: https://github.com/apache/hudi/pull/6732#discussion_r985359891


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1052,13 +875,29 @@ public void dropIndex(List 
partitionTypes) {
 }
   }
 
+  /**
+   * Performs Clustering for the workload stored in instant-time.
+   *
+   * @param clusteringInstantTime Clustering Instant Time
+   * @return Collection of WriteStatus to inspect errors and counts
+   */
+  public HoodieWriteMetadata cluster(String clusteringInstantTime) {
+if (delegateToTableManagerService(config, ActionType.replacecommit)) {
+  throw new HoodieException(ActionType.replacecommit.name() + " delegate 
to table management service!");

Review Comment:
   please align on the name



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseTableServiceClient.java:
##
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieClusteringPlan;
+import org.apache.hudi.avro.model.HoodieCompactionPlan;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.ActionType;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.model.TableServiceType;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieCommitException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.metrics.HoodieMetrics;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.HoodieWriteMetadata;
+
+import com.codahale.metrics.Timer;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.List;
+import java.util.Map;
+
+public abstract class BaseTableServiceClient extends CommonHoodieClient {

Review Comment:
   BaseHoodieTableServiceClient to align with BaseHoodieWriteClient



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/table/manager/HoodieTableManagerClient.java:
##
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.table.manager;
+
+import org.apache.hudi.common.config.HoodieTableManagerConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.exception.HoodieRemoteException;
+

[GitHub] [hudi] xushiyan merged pull request #6852: [MINOR] Fix testUpdateRejectForClustering

2022-10-05 Thread GitBox


xushiyan merged PR #6852:
URL: https://github.com/apache/hudi/pull/6852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6838: [MINOR] Update azure image and balance CI jobs

2022-10-05 Thread GitBox


hudi-bot commented on PR #6838:
URL: https://github.com/apache/hudi/pull/6838#issuecomment-1264955736

   
   ## CI report:
   
   * b4875afb16a2a8bdd0bce03f518af4fee9ada2a7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11948)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6281: [SUPPORT] AwsGlueCatalogSyncTool -The number of partition keys do not match the number of partition values

2022-10-05 Thread GitBox


xushiyan commented on issue #6281:
URL: https://github.com/apache/hudi/issues/6281#issuecomment-1264943872

   @crutis closing this as explained by @yihua . let us know how it works


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #6281: [SUPPORT] AwsGlueCatalogSyncTool -The number of partition keys do not match the number of partition values

2022-10-05 Thread GitBox


xushiyan closed issue #6281: [SUPPORT] AwsGlueCatalogSyncTool -The number of 
partition keys do not match the number of partition values 
URL: https://github.com/apache/hudi/issues/6281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan merged pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

2022-10-05 Thread GitBox


xushiyan merged PR #6851:
URL: https://github.com/apache/hudi/pull/6851


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on pull request #6732: [HUDI-4148] Add client for hudi table management service

2022-10-05 Thread GitBox


xushiyan commented on PR #6732:
URL: https://github.com/apache/hudi/pull/6732#issuecomment-1264932668

   @yuzhaojing please also fill up the PR description properly. as discussed, a 
class diagram to show the new hierarchy expedites the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on a diff in pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-05 Thread GitBox


zhangyue19921010 commented on code in PR #6003:
URL: https://github.com/apache/hudi/pull/6003#discussion_r985349298


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,238 @@
+
+
+# RFC-56: Early Conflict Detection For Multi-writer
+
+## Proposers
+
+- @zhangyue19921010
+
+## Approvers
+
+- @yihua
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1575
+
+## Abstract
+
+At present, Hudi implements an OCC (Optimistic Concurrency Control) based on 
timeline to ensure data consistency,
+integrity and correctness between multi-writers. OCC detects the conflict at 
Hudi's file group level, i.e., two
+concurrent writers updating the same file group are detected as a conflict. 
Currently, the conflict detection is
+performed before commit metadata and after the data writing is completed. If 
any conflict is detected, it leads to a
+waste of cluster resources because computing and writing were finished already.
+
+To solve this problem, this RFC proposes an early conflict detection mechanism 
to detect the conflict during the data
+writing phase and abort the writing early if conflict is detected, using 
Hudi's marker mechanism. Before writing each
+data file, the writer creates a corresponding marker to mark that the file is 
created, so that the writer can use the
+markers to automatically clean up uncommitted data in failure and rollback 
scenarios. We propose to use the markers
+identify the conflict at the file group level during writing data. There are 
some subtle differences in early conflict
+detection work flow between different types of marker maintainers. For direct 
markers, hoodie lists necessary marker
+files directly and does conflict checking before the writers creating markers 
and before starting to write corresponding
+data file. For the timeline-server based markers, hoodie just gets the result 
of marker conflict checking before the
+writers creating markers and before starting to write corresponding data 
files. The conflicts are asynchronously and
+periodically checked so that the writing conflicts can be detected as early as 
possible. Both writers may still write
+the data files of the same file slice, until the conflict is detected in the 
next round of checking.
+
+What's more? Hoodie can stop writing earlier because of early conflict 
detection and release the resources to cluster,
+improving resource utilization.
+
+Note that, the early conflict detection proposed by this RFC operates within 
OCC. Any conflict detection outside the
+scope of OCC is not handle. For example, current OCC for multiple writers 
cannot detect the conflict if two concurrent
+writers perform INSERT operations for the same set of record keys, because the 
writers write to different file groups.
+This RFC does not intend to address this problem.
+
+## Background
+
+As we know, transactions and multi-writers of data lakes are becoming the key 
characteristics of building Lakehouse
+these days. Quoting this inspiring blog Lakehouse Concurrency Control: 
Are we too optimistic? directly:
+https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic/
+
+> "Hudi implements a file level, log based concurrency control protocol on the 
Hudi timeline, which in-turn relies
+> on bare minimum atomic puts to cloud storage. By building on an event log as 
the central piece for inter process
+> coordination, Hudi is able to offer a few flexible deployment models that 
offer greater concurrency over pure OCC
+> approaches that just track table snapshots."
+
+In the multi-writer scenario, Hudi's existing conflict detection occurs after 
the writer finishing writing the data and
+before committing the metadata. In other words, the writer just detects the 
occurrence of the conflict when it starts to
+commit, although all calculations and data writing have been completed, which 
causes a waste of resources.
+
+For example:
+
+Now there are two writing jobs: job1 writes 10M data to the Hudi table, 
including updates to file group 1. Another job2
+writes 100G to the Hudi table, and also updates the same file group 1.
+
+Job1 finishes and commits to Hudi successfully. After a few hours, job2 
finishes writing data files(100G) and starts to
+commit metadata. At this time, a conflict with job1 is found, and the job2 has 
to be aborted and re-run after failure.
+Obviously, a lot of computing resources and time are wasted for job2.
+
+Hudi currently has two important mechanisms, marker mechanism and heartbeat 
mechanism:
+
+1. Marker mechanism can track all the files that are part of an active write.
+2. Heartbeat mechanism that can track all active writers to a Hudi table.
+
+Based on marker and heartbeat, this RFC proposes a new conflict detection: 
Early Conflict Detection. Before the writer
+creates the marker and before it starts to write the file, Hudi performs this 
new conflict detection, trying to detect
+the writing conflict directly (for direct markers) or get the async 

[GitHub] [hudi] xushiyan commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-10-05 Thread GitBox


xushiyan commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r985348379


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,369 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+
+Different processing modes depending on whether the meta server is enabled
+
+- Hudi metaserver is used
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce
+  lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a
+  callback through the hook at each instant commit, thereby calling the 
Table Management Service to do the
+  scheduling/execution for the table.
+  ![](service_with_meta_server.png)
+
+- Hudi metaserver is not used
+- for every write/commit on the table, the table management server is 
notified.
+- Each request to the table management server carries all pending instant 
matches the current action type.
+  ![](service_without_meta_server.png)
+
+### Processing flow
+
+- If hudi metaserver is used, after receiving the request, the table 
management server schedules the relevant table
+  service to the table's timeline

Review Comment:
   > schedules the relevant table service to the table's timeline
   
   need to make it explicit: this is table timeline managed in metaserver, 
right? it can confuse with the table timeline on storage. Should also mention 
how metaserver interact with storage in this case. 



##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, 

[GitHub] [hudi] yesemsanthoshkumar commented on a diff in pull request #6726: [HUDI-4630] Add transformer capability to individual feeds in MultiTableDeltaStreamer

2022-10-05 Thread GitBox


yesemsanthoshkumar commented on code in PR #6726:
URL: https://github.com/apache/hudi/pull/6726#discussion_r985346731


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java:
##
@@ -135,6 +135,7 @@ private void 
populateTableExecutionContextList(TypedProperties properties, Strin
   if (cfg.enableMetaSync && 
StringUtils.isNullOrEmpty(tableProperties.getString(HoodieSyncConfig.META_SYNC_TABLE_NAME.key(),
 ""))) {
 throw new HoodieException("Meta sync table field not provided!");
   }
+  populateTransformerProps(cfg, tableProperties);

Review Comment:
   @yihua Sure. I'm new to this. I'll work over this weekend.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on pull request #6003: [HUDI-1575][RFC-56] Early Conflict Detection For Multi-writer

2022-10-05 Thread GitBox


zhangyue19921010 commented on PR #6003:
URL: https://github.com/apache/hudi/pull/6003#issuecomment-1264887516

   Hi @yihua and @pratyakshsharma . Really appreciate for your attention here! 
Address the comments. PTAL :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dragonH commented on issue #6832: [SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive sync issue)

2022-10-05 Thread GitBox


dragonH commented on issue #6832:
URL: https://github.com/apache/hudi/issues/6832#issuecomment-1264882911

   hi @codope 
   
   sure, wiil also do the latest hudi testing with EMR and share th result here
   
   thanks for the help
   
   hi @kazdy 
   
   thanks for the help
   
   i acknowledge the behavior of aws glue that converting the table name and 
columns name to lowarcase
   
   but was suprised that this casue the issue 
   
   after converting the table name to lowarcase
   
   the data was written successfully
   
   https://user-images.githubusercontent.com/18332044/193495139-64ddc70d-1468-49c8-8772-ba7af43e80dc.png;>
   
   just curious the step of how hudi created and sync the table
   
   cause the we could see the table was created with the lowarcase name
   
   how come it used the original name( with upper case) to find and compared 
the partition 樂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6854: [HUDI-4631] Adding retries to spark datasource writes on conflict failures:

2022-10-05 Thread GitBox


hudi-bot commented on PR #6854:
URL: https://github.com/apache/hudi/pull/6854#issuecomment-1264756277

   
   ## CI report:
   
   * 3fd99e92b8be748fa52e025f8bc6bbf6681df359 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios

2022-10-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4631:
-
Labels: pull-request-available  (was: )

> Enhance retries for failed writes w/ write conflicts in a multi writer 
> scenarios
> 
>
> Key: HUDI-4631
> URL: https://issues.apache.org/jira/browse/HUDI-4631
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> lets say there are two writers from t0 to t5. so hudi fails w2 and succeeds 
> w1. and user restarts w2 and for next 5 mins, lets say there are no other 
> overlapping writers. So the same write from w2 will now succeed. so, whenever 
> there is a write conflict and pipeline fails, all user needs to do is, just 
> restart the pipeline or retry to ingest the same batch.
>  
> Ask: can we add retries within hudi during such failures. Anyways, in most 
> cases, users just restart the pipeline in such cases. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6854: [HUDI-4631] Adding retries to spark datasource writes on conflict failures:

2022-10-05 Thread GitBox


hudi-bot commented on PR #6854:
URL: https://github.com/apache/hudi/pull/6854#issuecomment-1264730004

   
   ## CI report:
   
   * 3fd99e92b8be748fa52e025f8bc6bbf6681df359 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6854: [HUDI-4631] Adding retries to spark datasource writes on conflict failures:

2022-10-05 Thread GitBox


hudi-bot commented on PR #6854:
URL: https://github.com/apache/hudi/pull/6854#issuecomment-1264729199

   
   ## CI report:
   
   * 3fd99e92b8be748fa52e025f8bc6bbf6681df359 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #6851: [HUDI-4966] Add a partition extractor to handle partition values with slashes

2022-10-05 Thread GitBox


yihua commented on PR #6851:
URL: https://github.com/apache/hudi/pull/6851#issuecomment-1264718033

   > @yihua thanks for looking into this. I think the user's problem can also 
be resolved by using `SlashEncodedDayPartitionValueExtractor` ? probably need 
to follow the [migration 
guide](https://hudi.apache.org/releases/release-0.12.0#configuration-updates)
   
   If the date output format is “/MM/dd”, yes.  But we should also allow 
user to specify any format, e.g., “MM/dd/HH” or “MM/dd/”, which don’t 
have any corresponding partition extractor that works.  The new partition 
extractor addresses this problem.  Even for “/MM/dd”, there is no need to 
specify extractor after the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #6854: [HUDI-4631] Adding retries to spark datasource writes on conflict failures:

2022-10-05 Thread GitBox


nsivabalan opened a new pull request, #6854:
URL: https://github.com/apache/hudi/pull/6854

   ### Change Logs
   
   With Hudi's OCC, one of the commit is expected to fail if there are 
overalapping writes. From a user's standpoint, very likely user may retry the 
failed write w/o any additional action. So, adding a retry functionality to 
spark datasource writes with hudi automatically incase of conflict failures. 
   
   ### Impact
   
   User experience w/ multi-writers will be improved with these automatic 
retries. 
   
   **Risk level: medium **
   
   Users should enable the retries w/ caution since it could keep retrying the 
failed commit again until max retries are exhausted. Could result in some 
compute cost for large batches. 
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - Configs introduced 
  - `hoodie.write.lock.retry.on.conflict.failures` : to enable retries on 
conflict failures. Default is false. 
  - `hoodie.write.lock.num.retries.on.conflict.failures` : max number of 
times to retry on conflict failures.
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6852: [MINOR] Fix testUpdateRejectForClustering

2022-10-05 Thread GitBox


hudi-bot commented on PR #6852:
URL: https://github.com/apache/hudi/pull/6852#issuecomment-1264701402

   
   ## CI report:
   
   * 8a6ef11a8c573fa3cd49217157a6a8bb7f112395 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11965)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >