[GitHub] [hudi] 7c00 opened a new pull request #4563: [HUDI-3211][RFC-44] Add RFC for Hudi Connector for Presto

2022-01-10 Thread GitBox


7c00 opened a new pull request #4563:
URL: https://github.com/apache/hudi/pull/4563


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


leesf commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009684128


   > @leesf yes we won't have conflicting patches to pick from master. we can 
land this one now.
   
   @xushiyan CI passed, so can we merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009641031


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b6a98a3a1cdfc690105ca026788bb9d24393b4c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5057)
 
   * 3ca5b126e192706635056511d4acd5f7b01f6ea0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009673555


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * 3ca5b126e192706635056511d4acd5f7b01f6ea0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4562: [HUDI-3211][RFC-44] Claim RFC number for RFC for Hudi Connector for Presto

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4562:
URL: https://github.com/apache/hudi/pull/4562#issuecomment-1009663658


   
   ## CI report:
   
   * db689b8aec1a61f6d034ba37a471c832fbe1d6f0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4562: [HUDI-3211][RFC-44] Claim RFC number for RFC for Hudi Connector for Presto

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4562:
URL: https://github.com/apache/hudi/pull/4562#issuecomment-1009661808


   
   ## CI report:
   
   * db689b8aec1a61f6d034ba37a471c832fbe1d6f0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4562: [HUDI-3211][RFC-44] Claim RFC number for RFC for Hudi Connector for Presto

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4562:
URL: https://github.com/apache/hudi/pull/4562#issuecomment-1009661808


   
   ## CI report:
   
   * db689b8aec1a61f6d034ba37a471c832fbe1d6f0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3211) RFC for Presto Hudi connector

2022-01-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3211:
-
Labels: pull-request-available  (was: )

>  RFC for Presto Hudi connector
> --
>
> Key: HUDI-3211
> URL: https://issues.apache.org/jira/browse/HUDI-3211
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Todd Gao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] 7c00 opened a new pull request #4562: [HUDI-3211] Claim RFC number for RFC for Hudi Connector for Presto

2022-01-10 Thread GitBox


7c00 opened a new pull request #4562:
URL: https://github.com/apache/hudi/pull/4562


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros

2022-01-10 Thread GitBox


danny0405 commented on a change in pull request #4548:
URL: https://github.com/apache/hudi/pull/4548#discussion_r781801961



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/table/format/cow/Int64TimestampColumnReader.java
##
@@ -75,25 +90,29 @@ protected void readBatchFromDictionaryIds(
 for (int i = rowId; i < rowId + num; ++i) {
   if (!column.isNullAt(i)) {
 column.setTimestamp(i, decodeInt64ToTimestamp(
-utcTimestamp, dictionary, dictionaryIds.getInt(i)));
+utcTimestamp, dictionary, dictionaryIds.getInt(i), chronoUnit));
   }
 }
   }
 
   public static TimestampData decodeInt64ToTimestamp(
   boolean utcTimestamp,
   org.apache.parquet.column.Dictionary dictionary,
-  int id) {
+  int id,
+  ChronoUnit unit) {
 long value = dictionary.decodeToLong(id);
-return int64ToTimestamp(utcTimestamp, value);
+return int64ToTimestamp(utcTimestamp, value, unit);
   }
 
-  private static TimestampData int64ToTimestamp(boolean utcTimestamp, long 
millionsOfDay) {
+  private static TimestampData int64ToTimestamp(
+  boolean utcTimestamp,
+  long millionsOfDay,
+  ChronoUnit unit) {

Review comment:
   millionsOfDay => interval




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3211) RFC for Presto Hudi connector

2022-01-10 Thread Todd Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Gao updated HUDI-3211:
---
Summary:  RFC for Presto Hudi connector  (was: Claim RFC number for RFC for 
Presto Hudi connector)

>  RFC for Presto Hudi connector
> --
>
> Key: HUDI-3211
> URL: https://issues.apache.org/jira/browse/HUDI-3211
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Todd Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3211) Claim RFC number for RFC for Presto Hudi connector

2022-01-10 Thread Todd Gao (Jira)
Todd Gao created HUDI-3211:
--

 Summary: Claim RFC number for RFC for Presto Hudi connector
 Key: HUDI-3211
 URL: https://issues.apache.org/jira/browse/HUDI-3211
 Project: Apache Hudi
  Issue Type: Task
Reporter: Todd Gao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4548:
URL: https://github.com/apache/hudi/pull/4548#issuecomment-1009643704


   
   ## CI report:
   
   * ed10ca789674ce963ab430d0656b1e05d3bb6a70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5052)
 
   * 83b5d6d0346dbb0d11796f2e70d76f6f92d64632 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5092)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4548:
URL: https://github.com/apache/hudi/pull/4548#issuecomment-1009642387


   
   ## CI report:
   
   * ed10ca789674ce963ab430d0656b1e05d3bb6a70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5052)
 
   * 83b5d6d0346dbb0d11796f2e70d76f6f92d64632 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009620230


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * 08c493bdeb441e150177fdaf3cbc5e4bcfabc42a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4910)
 
   * e47f03caabfd7af8588cce120989e0370111ae62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5089)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009643659


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * e47f03caabfd7af8588cce120989e0370111ae62 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5089)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3210) [UMBRELLA] A new Presto connector for Hudi

2022-01-10 Thread Todd Gao (Jira)
Todd Gao created HUDI-3210:
--

 Summary: [UMBRELLA] A new Presto connector for Hudi
 Key: HUDI-3210
 URL: https://issues.apache.org/jira/browse/HUDI-3210
 Project: Apache Hudi
  Issue Type: Epic
  Components: Presto Integration
Reporter: Todd Gao


This JIRA tracks all the tasks related to building a new Hudi connector in 
Presto.
h4.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4548:
URL: https://github.com/apache/hudi/pull/4548#issuecomment-1009642387


   
   ## CI report:
   
   * ed10ca789674ce963ab430d0656b1e05d3bb6a70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5052)
 
   * 83b5d6d0346dbb0d11796f2e70d76f6f92d64632 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4561: handleEndInputEvent is executed synchronously

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4561:
URL: https://github.com/apache/hudi/pull/4561#issuecomment-1009641110


   
   ## CI report:
   
   * b4202962c8519b921ae11ba0a5481ee0887332f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4561: handleEndInputEvent is executed synchronously

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4561:
URL: https://github.com/apache/hudi/pull/4561#issuecomment-1009642427


   
   ## CI report:
   
   * b4202962c8519b921ae11ba0a5481ee0887332f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4548: [HUDI-3184] hudi-flink support timestamp-micros

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4548:
URL: https://github.com/apache/hudi/pull/4548#issuecomment-1008698102


   
   ## CI report:
   
   * ed10ca789674ce963ab430d0656b1e05d3bb6a70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5052)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4561: handleEndInputEvent is executed synchronously

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4561:
URL: https://github.com/apache/hudi/pull/4561#issuecomment-1009641110


   
   ## CI report:
   
   * b4202962c8519b921ae11ba0a5481ee0887332f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009633106


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b6a98a3a1cdfc690105ca026788bb9d24393b4c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5057)
 
   * 3ca5b126e192706635056511d4acd5f7b01f6ea0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009641031


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b6a98a3a1cdfc690105ca026788bb9d24393b4c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5057)
 
   * 3ca5b126e192706635056511d4acd5f7b01f6ea0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] todd5167 opened a new pull request #4561: handleEndInputEvent is executed synchronously

2022-01-10 Thread GitBox


todd5167 opened a new pull request #4561:
URL: https://github.com/apache/hudi/pull/4561


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   fix: https://github.com/apache/hudi/issues/4469.
   
   handleEndInputEvent is executed synchronously,  asynchronous execution will 
cause the flink cluster to shut down early before time processing is complete.
   
   ## Brief change log
 *Modify:  
org.apache.hudi.sink.StreamWriteOperatorCoordinator#handleEventFromOperator*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1008858401


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b6a98a3a1cdfc690105ca026788bb9d24393b4c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5057)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009633106


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b6a98a3a1cdfc690105ca026788bb9d24393b4c2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5057)
 
   * 3ca5b126e192706635056511d4acd5f7b01f6ea0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1629) Change partitioner abstraction to implement multiple strategies

2022-01-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-1629:
---

Assignee: Ethan Guo  (was: Thirumalai Raj R)

> Change partitioner abstraction to implement multiple strategies
> ---
>
> Key: HUDI-1629
> URL: https://issues.apache.org/jira/browse/HUDI-1629
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Existing UpsertPartitioner only considers file sizing to assign 
> inserts/updates. We also want to consider data locality and other factors. So 
> change partitioner abstraction to make it easy to implement and plug in other 
> strategies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1631) Sort data when creating a new version of file

2022-01-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-1631:
---

Assignee: Ethan Guo  (was: Thirumalai Raj R)

> Sort data when creating a new version of file
> -
>
> Key: HUDI-1631
> URL: https://issues.apache.org/jira/browse/HUDI-1631
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Add option to sort data by specific file when creating new version of file. 
> Anytime we open a file and write data (whehther to apply updates/add new 
> records), we want to sort data in the file by specified column(s).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1630) Add partitioner strategy for improving data locality

2022-01-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-1630:
---

Assignee: Ethan Guo  (was: Thirumalai Raj R)

> Add partitioner strategy for improving data locality
> 
>
> Key: HUDI-1630
> URL: https://issues.apache.org/jira/browse/HUDI-1630
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> We can use index/metadata information and co-locate records with same value 
> for specified column(s) in same fileId



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] dongkelun edited a comment on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


dongkelun edited a comment on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009623456


   @nsivabalan @codope 
   I found a previous 
issuse:[HUDI-1739](https://issues.apache.org/jira/browse/HUDI-1739)  in 
'MetadataConversionUtils. getRequestedReplaceMetadata ` is the same as this 
issuse
   
   [#2784](https://github.com/apache/hudi/pull/2784)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


dongkelun commented on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009623456


   @nsivabalan @codope 
   I found a previous 
issuse:[HUDI-1739](https://issues.apache.org/jira/browse/HUDI-1739)  in 
'MetadataConversionUtils. getRequestedReplaceMetadata ` is the same as this 
issuse


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-26) Introduce a way to collapse filegroups into one and reindex #491

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-26.


> Introduce a way to collapse filegroups into one and reindex #491
> 
>
> Key: HUDI-26
> URL: https://issues.apache.org/jira/browse/HUDI-26
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>
> https://github.com/uber/hudi/issues/491



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-10) Auto tune bulk insert parallelism #555

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-10?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-10:
---
Priority: Blocker  (was: Major)

> Auto tune bulk insert parallelism #555
> --
>
> Key: HUDI-10
> URL: https://issues.apache.org/jira/browse/HUDI-10
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/555



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-10) Auto tune bulk insert parallelism #555

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-10?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-10:
---
Fix Version/s: 0.11.0

> Auto tune bulk insert parallelism #555
> --
>
> Key: HUDI-10
> URL: https://issues.apache.org/jira/browse/HUDI-10
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/555



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-10) Auto tune bulk insert parallelism #555

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-10?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-10:
--

Assignee: Ethan Guo

> Auto tune bulk insert parallelism #555
> --
>
> Key: HUDI-10
> URL: https://issues.apache.org/jira/browse/HUDI-10
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Major
>
> https://github.com/uber/hudi/issues/555



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-64:
---
Priority: Blocker  (was: Major)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.12.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472469#comment-17472469
 ] 

Vinoth Chandar commented on HUDI-64:


[~guoyihua] Assigning to you to triage this again and see if its relevant

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.11.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-64:
---
Fix Version/s: 0.11.0
   (was: 0.12.0)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.11.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-64:
---
Fix Version/s: 0.12.0

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
> Fix For: 0.12.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-64:
--

Assignee: Ethan Guo  (was: Vinoth Chandar)

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.12.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1127) Handling late arriving Deletes

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1127:
-
Priority: Blocker  (was: Major)

> Handling late arriving Deletes
> --
>
> Key: HUDI-1127
> URL: https://issues.apache.org/jira/browse/HUDI-1127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Writer Core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Recently I was working on a [PR|https://github.com/apache/hudi/pull/1704] to 
> enhance OverwriteWithLatestAvroPayload class to consider records in storage 
> when merging. Briefly, this class will ignore older updates if the record in 
> storage is the latest one ( based on the Precombine field). 
> Based on this, the expectation is that we handle any write operation that 
> should be dealt with the same way - if they are older they should be ignored. 
> While at this, I identified that we cannot handle all Deletes the same way. 
> This is because we process deletes in two ways mainly -
>  * by adding and enabling a metadata field  `_hoodie_is_deleted` to our in 
> the original record and sending it as an UPSERT operation.
>  * by using an empty payload using the EmptyHoodieRecordPayload and sending 
> the write as a DELETE operation. 
> While the former has ordering field and can be processed as expected (older 
> deletes will be ignored), the later does not have any ordering field to 
> identify if its an older delete or not and hence will let the older delete to 
> go through.
> Just opening this issue to track this gap. We would need to identify what is 
> the right choice here and fix as needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1127) Handling late arriving Deletes

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1127:


Assignee: Alexey Kudinkin  (was: Bhavani Sudha)

> Handling late arriving Deletes
> --
>
> Key: HUDI-1127
> URL: https://issues.apache.org/jira/browse/HUDI-1127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, Writer Core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Recently I was working on a [PR|https://github.com/apache/hudi/pull/1704] to 
> enhance OverwriteWithLatestAvroPayload class to consider records in storage 
> when merging. Briefly, this class will ignore older updates if the record in 
> storage is the latest one ( based on the Precombine field). 
> Based on this, the expectation is that we handle any write operation that 
> should be dealt with the same way - if they are older they should be ignored. 
> While at this, I identified that we cannot handle all Deletes the same way. 
> This is because we process deletes in two ways mainly -
>  * by adding and enabling a metadata field  `_hoodie_is_deleted` to our in 
> the original record and sending it as an UPSERT operation.
>  * by using an empty payload using the EmptyHoodieRecordPayload and sending 
> the write as a DELETE operation. 
> While the former has ordering field and can be processed as expected (older 
> deletes will be ignored), the later does not have any ordering field to 
> identify if its an older delete or not and hence will let the older delete to 
> go through.
> Just opening this issue to track this gap. We would need to identify what is 
> the right choice here and fix as needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-10 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472466#comment-17472466
 ] 

Harsha Teja Kanna commented on HUDI-2909:
-

Hi, Thanks,

I will recreate the table. No problem.

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1698) Multiwriting for Flink / Java

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1698:


Assignee: Ethan Guo  (was: Nishith Agarwal)

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1698:
-
Fix Version/s: 0.11.0

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1698:
-
Priority: Blocker  (was: Major)

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1576) Add ability to perform archival synchronously

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1576:
-
Fix Version/s: 0.11.0

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.11.0
>
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1576) Add ability to perform archival synchronously

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1576:
-
Description: 
Currently, archival runs inline. We want to move archival to a table service 
like cleaning, compaction etc..

and treat it like that. of course, no new action will be introduced. 

 

  was:Currently, archival runs inline. We want to move archival to async table 
service like cleaning, compaction etc..


> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Priority: Major
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1576) Add ability to perform archival synchronously

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1576:
-
Priority: Blocker  (was: Major)

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1576) Add ability to perform archival synchronously

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1576:


Assignee: Ethan Guo

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Major
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] dongkelun commented on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


dongkelun commented on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009620926


   > @dongkelun Rather than introduce a behavior change, I would prefer to 
remove the warn log itself. Is there any reason we still want to keep it? Also, 
while you're at it, can you make the method `getRequestedReplaceMetadata` 
private? I don't see it being used anywhere other than the same class.
   
   OK, I have made the method `getRequestedReplaceMetadata` in classes 
`MetadataConversionUtils` and `ClusteringUtils ` private


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009620230


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * 08c493bdeb441e150177fdaf3cbc5e4bcfabc42a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4910)
 
   * e47f03caabfd7af8588cce120989e0370111ae62 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5089)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009605269


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * 08c493bdeb441e150177fdaf3cbc5e4bcfabc42a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4910)
 
   * e47f03caabfd7af8588cce120989e0370111ae62 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Priority: Blocker  (was: Critical)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDDs in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1629) Change partitioner abstraction to implement multiple strategies

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1629:
-
Fix Version/s: 0.11.0

> Change partitioner abstraction to implement multiple strategies
> ---
>
> Key: HUDI-1629
> URL: https://issues.apache.org/jira/browse/HUDI-1629
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Existing UpsertPartitioner only considers file sizing to assign 
> inserts/updates. We also want to consider data locality and other factors. So 
> change partitioner abstraction to make it easy to implement and plug in other 
> strategies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1629) Change partitioner abstraction to implement multiple strategies

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1629:
-
Priority: Blocker  (was: Major)

> Change partitioner abstraction to implement multiple strategies
> ---
>
> Key: HUDI-1629
> URL: https://issues.apache.org/jira/browse/HUDI-1629
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
>
> Existing UpsertPartitioner only considers file sizing to assign 
> inserts/updates. We also want to consider data locality and other factors. So 
> change partitioner abstraction to make it easy to implement and plug in other 
> strategies.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1631) Sort data when creating a new version of file

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1631:
-
Priority: Blocker  (was: Major)

> Sort data when creating a new version of file
> -
>
> Key: HUDI-1631
> URL: https://issues.apache.org/jira/browse/HUDI-1631
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
>
> Add option to sort data by specific file when creating new version of file. 
> Anytime we open a file and write data (whehther to apply updates/add new 
> records), we want to sort data in the file by specified column(s).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1630) Add partitioner strategy for improving data locality

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1630:
-
Fix Version/s: 0.11.0

> Add partitioner strategy for improving data locality
> 
>
> Key: HUDI-1630
> URL: https://issues.apache.org/jira/browse/HUDI-1630
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
> Fix For: 0.11.0
>
>
> We can use index/metadata information and co-locate records with same value 
> for specified column(s) in same fileId



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1631) Sort data when creating a new version of file

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1631:
-
Fix Version/s: 0.11.0

> Sort data when creating a new version of file
> -
>
> Key: HUDI-1631
> URL: https://issues.apache.org/jira/browse/HUDI-1631
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Add option to sort data by specific file when creating new version of file. 
> Anytime we open a file and write data (whehther to apply updates/add new 
> records), we want to sort data in the file by specified column(s).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1630) Add partitioner strategy for improving data locality

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1630:
-
Priority: Blocker  (was: Major)

> Add partitioner strategy for improving data locality
> 
>
> Key: HUDI-1630
> URL: https://issues.apache.org/jira/browse/HUDI-1630
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Blocker
>
> We can use index/metadata information and co-locate records with same value 
> for specified column(s) in same fileId



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2003:


Assignee: Alexey Kudinkin

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinay
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2003:
-
Priority: Blocker  (was: Major)

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinay
>Priority: Blocker
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2003:
-
Fix Version/s: 0.11.0

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinay
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.11.0
>
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-52) Implement Savepoints for Merge On Read table #88

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-52:
---
Labels: core-flow-ds help-requested pull-request-available sev:critical 
starter  (was: core-flow-ds help-requested pull-request-available sev:high 
starter)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: core-flow-ds, help-requested, pull-request-available, 
> sev:critical, starter
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-52) Implement Savepoints for Merge On Read table #88

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-52:
---
Priority: Blocker  (was: Major)

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: core-flow-ds, help-requested, pull-request-available, 
> sev:critical, starter
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1009615068


   
   ## CI report:
   
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * f4e52650a18386d10d58e80f9da02a8d89ae614e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4559: [HUDI-3206][Stacked on 4556] Unify Hive's MOR implementations to avoid duplication

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4559:
URL: https://github.com/apache/hudi/pull/4559#issuecomment-1009567880


   
   ## CI report:
   
   * fc2cdb0f7d9ca50c2414d134c23956f455274ad7 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5079)
 
   * 47970bd3a9cbbf2eb85b0a87f899256487efdffa UNKNOWN
   * f4e52650a18386d10d58e80f9da02a8d89ae614e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-10 Thread GitBox


xushiyan commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1009614560


   @leesf yes we won't have conflicting patches to pick from master. we can 
land this one now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] 06/06: [HUDI-3148] Create pushgateway client based on port (#4497)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 358b43e3b18ea49a09ab83726a55e217c5fdbc06
Author: t0il3ts0ap 
AuthorDate: Tue Jan 11 04:39:47 2022 +0530

[HUDI-3148] Create pushgateway client based on port (#4497)


Co-authored-by: anoop narang 
Co-authored-by: sivabalan narayanan 
---
 .../prometheus/PushGatewayMetricsReporter.java|  3 ++-
 .../hudi/metrics/prometheus/PushGatewayReporter.java  | 19 +--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayMetricsReporter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayMetricsReporter.java
index 17c4d7b..fa4c947 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayMetricsReporter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayMetricsReporter.java
@@ -50,7 +50,8 @@ public class PushGatewayMetricsReporter extends 
MetricsReporter {
 TimeUnit.SECONDS,
 TimeUnit.SECONDS,
 getJobName(),
-serverHost + ":" + serverPort,
+serverHost,
+serverPort,
 deleteShutdown);
   }
 
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayReporter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayReporter.java
index 3b19882..5f82b66 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayReporter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/prometheus/PushGatewayReporter.java
@@ -29,6 +29,8 @@ import com.codahale.metrics.ScheduledReporter;
 import io.prometheus.client.CollectorRegistry;
 import io.prometheus.client.dropwizard.DropwizardExports;
 import io.prometheus.client.exporter.PushGateway;
+import java.net.MalformedURLException;
+import java.net.URL;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
@@ -51,17 +53,30 @@ public class PushGatewayReporter extends ScheduledReporter {
 TimeUnit rateUnit,
 TimeUnit durationUnit,
 String jobName,
-String address,
+String serverHost,
+int serverPort,
 boolean deleteShutdown) {
 super(registry, "hudi-push-gateway-reporter", filter, rateUnit, 
durationUnit);
 this.jobName = jobName;
 this.deleteShutdown = deleteShutdown;
 collectorRegistry = new CollectorRegistry();
 metricExports = new DropwizardExports(registry);
-pushGateway = new PushGateway(address);
+pushGateway = createPushGatewayClient(serverHost, serverPort);
 metricExports.register(collectorRegistry);
   }
 
+  private PushGateway createPushGatewayClient(String serverHost, int 
serverPort) {
+if (serverPort == 443) {
+  try {
+return new PushGateway(new URL("https://; + serverHost + ":" + 
serverPort));
+  } catch (MalformedURLException e) {
+e.printStackTrace();
+throw new IllegalArgumentException("Malformed pushgateway host: " + 
serverHost);
+  }
+}
+return new PushGateway(serverHost + ":" + serverPort);
+  }
+
   @Override
   public void report(SortedMap gauges,
  SortedMap counters,


[hudi] 01/06: [HUDI-3157] Remove aws jars from hudi bundles (#4542)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b7658bcc84c18225d36ab733b6d0b0978fe68521
Author: RexAn 
AuthorDate: Sun Jan 9 18:23:46 2022 +0800

[HUDI-3157] Remove aws jars from hudi bundles (#4542)

Co-authored-by: Hui An 
---
 packaging/hudi-spark-bundle/pom.xml | 4 
 packaging/hudi-utilities-bundle/pom.xml | 4 
 2 files changed, 8 deletions(-)

diff --git a/packaging/hudi-spark-bundle/pom.xml 
b/packaging/hudi-spark-bundle/pom.xml
index 3544e31..9315943 100644
--- a/packaging/hudi-spark-bundle/pom.xml
+++ b/packaging/hudi-spark-bundle/pom.xml
@@ -103,10 +103,6 @@
   com.yammer.metrics:metrics-core
   com.google.guava:guava
 
-  com.amazonaws:dynamodb-lock-client
-  com.amazonaws:aws-java-sdk-dynamodb
-  com.amazonaws:aws-java-sdk-core
-
   
org.apache.spark:spark-avro_${scala.binary.version}
   org.apache.hive:hive-common
   org.apache.hive:hive-service
diff --git a/packaging/hudi-utilities-bundle/pom.xml 
b/packaging/hudi-utilities-bundle/pom.xml
index a3da0a8..4365860 100644
--- a/packaging/hudi-utilities-bundle/pom.xml
+++ b/packaging/hudi-utilities-bundle/pom.xml
@@ -113,10 +113,6 @@
   org.antlr:stringtemplate
   org.apache.parquet:parquet-avro
 
-  com.amazonaws:dynamodb-lock-client
-  com.amazonaws:aws-java-sdk-dynamodb
-  com.amazonaws:aws-java-sdk-core
-
   com.github.davidmoten:guava-mini
   com.github.davidmoten:hilbert-curve
   
com.twitter:bijection-avro_${scala.binary.version}


[hudi] 04/06: [HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi (#4544)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c39e342b154cd88d1905122d30b8887f8028d204
Author: Y Ethan Guo 
AuthorDate: Mon Jan 10 12:31:25 2022 -0800

[HUDI-2735] Allow empty commits in Kafka Connect Sink for Hudi (#4544)
---
 .../transaction/ConnectTransactionCoordinator.java |  4 +--
 .../connect/TestConnectTransactionCoordinator.java | 42 --
 2 files changed, 32 insertions(+), 14 deletions(-)

diff --git 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/transaction/ConnectTransactionCoordinator.java
 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/transaction/ConnectTransactionCoordinator.java
index 14fd880..1157b21 100644
--- 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/transaction/ConnectTransactionCoordinator.java
+++ 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/transaction/ConnectTransactionCoordinator.java
@@ -294,7 +294,7 @@ public class ConnectTransactionCoordinator implements 
TransactionCoordinator, Ru
 long totalRecords = (long) 
allWriteStatuses.stream().mapToDouble(WriteStatus::getTotalRecords).sum();
 boolean hasErrors = totalErrorRecords > 0;
 
-if ((!hasErrors || configs.allowCommitOnErrors()) && 
!allWriteStatuses.isEmpty()) {
+if (!hasErrors || configs.allowCommitOnErrors()) {
   boolean success = transactionServices.endCommit(currentCommitTime,
   allWriteStatuses,
   transformKafkaOffsets(currentConsumedKafkaOffsets));
@@ -319,8 +319,6 @@ public class ConnectTransactionCoordinator implements 
TransactionCoordinator, Ru
   ws.getErrors().forEach((key, value) -> LOG.trace("Error for 
key:" + key + " is " + value));
 }
   });
-} else {
-  LOG.warn("Empty write statuses were received from all Participants");
 }
 
 // Submit the next start commit, that will rollback the current commit.
diff --git 
a/hudi-kafka-connect/src/test/java/org/apache/hudi/connect/TestConnectTransactionCoordinator.java
 
b/hudi-kafka-connect/src/test/java/org/apache/hudi/connect/TestConnectTransactionCoordinator.java
index f003fe9..d939351 100644
--- 
a/hudi-kafka-connect/src/test/java/org/apache/hudi/connect/TestConnectTransactionCoordinator.java
+++ 
b/hudi-kafka-connect/src/test/java/org/apache/hudi/connect/TestConnectTransactionCoordinator.java
@@ -178,27 +178,39 @@ public class TestConnectTransactionCoordinator {
   List controlEvents = new ArrayList<>();
   switch (testScenario) {
 case ALL_CONNECT_TASKS_SUCCESS:
-  composeControlEvent(message.getCommitTime(), false, 
kafkaOffsets, controlEvents);
+  composeControlEvent(
+  message.getCommitTime(), false, false, kafkaOffsets, 
controlEvents);
+  numPartitionsThatReportWriteStatus = TOTAL_KAFKA_PARTITIONS;
+  // This commit round should succeed, and the kafka offsets 
getting committed
+  kafkaOffsetsCommitted.putAll(kafkaOffsets);
+  expectedMsgType = ControlMessage.EventType.ACK_COMMIT;
+  break;
+case ALL_CONNECT_TASKS_WITH_EMPTY_WRITE_STATUS:
+  composeControlEvent(
+  message.getCommitTime(), false, true, kafkaOffsets, 
controlEvents);
   numPartitionsThatReportWriteStatus = TOTAL_KAFKA_PARTITIONS;
   // This commit round should succeed, and the kafka offsets 
getting committed
   kafkaOffsetsCommitted.putAll(kafkaOffsets);
   expectedMsgType = ControlMessage.EventType.ACK_COMMIT;
   break;
 case SUBSET_WRITE_STATUS_FAILED_BUT_IGNORED:
-  composeControlEvent(message.getCommitTime(), true, kafkaOffsets, 
controlEvents);
+  composeControlEvent(
+  message.getCommitTime(), true, false, kafkaOffsets, 
controlEvents);
   numPartitionsThatReportWriteStatus = TOTAL_KAFKA_PARTITIONS;
   // Despite error records, this commit round should succeed, and 
the kafka offsets getting committed
   kafkaOffsetsCommitted.putAll(kafkaOffsets);
   expectedMsgType = ControlMessage.EventType.ACK_COMMIT;
   break;
 case SUBSET_WRITE_STATUS_FAILED:
-  composeControlEvent(message.getCommitTime(), true, kafkaOffsets, 
controlEvents);
+  composeControlEvent(
+  message.getCommitTime(), true, false, kafkaOffsets, 
controlEvents);
   numPartitionsThatReportWriteStatus = TOTAL_KAFKA_PARTITIONS;
   // This commit round should fail, and a new commit round should 
start without kafka offsets getting committed
   expectedMsgType = ControlMessage.EventType.START_COMMIT;
   break;
 case 

[hudi] 05/06: [MINOR] Fix port number in setupKafka.sh (#4546)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 44a85c7ab04fe32aea7ed492d2d9fa3e546c0d94
Author: Y Ethan Guo 
AuthorDate: Mon Jan 10 13:07:52 2022 -0800

[MINOR] Fix port number in setupKafka.sh (#4546)
---
 hudi-kafka-connect/demo/setupKafka.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hudi-kafka-connect/demo/setupKafka.sh 
b/hudi-kafka-connect/demo/setupKafka.sh
index c75b4a9..5c618b2 100755
--- a/hudi-kafka-connect/demo/setupKafka.sh
+++ b/hudi-kafka-connect/demo/setupKafka.sh
@@ -130,7 +130,7 @@ fi
 # Setup the schema registry
 export SCHEMA=$(sed 's|/\*|\n&|g;s|*/|&\n|g' ${schemaFile} | sed 
'/\/\*/,/*\//d' | jq tostring)
 curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data 
"{\"schema\": $SCHEMA}" 
http://localhost:8082/subjects/${kafkaTopicName}/versions
-curl -X GET http://localhost:8081/subjects/${kafkaTopicName}/versions/latest
+curl -X GET http://localhost:8082/subjects/${kafkaTopicName}/versions/latest
 
 # Generate kafka messages from raw records
 # Each records with unique keys and generate equal messages across each hudi 
partition


[hudi] 03/06: Removing rollbacks instants from timeline for restore operation (#4518)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit b6fde073961b8ddb5a2175f8ed0d6ea59e5798a4
Author: Sivabalan Narayanan 
AuthorDate: Sun Jan 9 21:14:28 2022 -0500

Removing rollbacks instants from timeline for restore operation (#4518)
---
 .../hudi/table/action/restore/BaseRestoreActionExecutor.java   | 10 ++
 .../functional/TestHoodieClientOnCopyOnWriteStorage.java   |  2 ++
 2 files changed, 12 insertions(+)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/restore/BaseRestoreActionExecutor.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/restore/BaseRestoreActionExecutor.java
index 9371340..58247bb 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/restore/BaseRestoreActionExecutor.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/restore/BaseRestoreActionExecutor.java
@@ -99,6 +99,16 @@ public abstract class BaseRestoreActionExecutor instantsToRollback = 
table.getActiveTimeline().getRollbackTimeline()
+.getReverseOrderedInstants()
+.filter(instant -> 
HoodieActiveTimeline.GREATER_THAN.test(instant.getTimestamp(), 
restoreInstantTime))
+.collect(Collectors.toList());
+instantsToRollback.forEach(entry -> {
+  table.getActiveTimeline().deletePending(new 
HoodieInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.ROLLBACK_ACTION, 
entry.getTimestamp()));
+  table.getActiveTimeline().deletePending(new 
HoodieInstant(HoodieInstant.State.REQUESTED, HoodieTimeline.ROLLBACK_ACTION, 
entry.getTimestamp()));
+});
 LOG.info("Commits " + instantsRolledBack + " rollback is complete. 
Restored table to " + restoreInstantTime);
 return restoreMetadata;
   }
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java
index aa3ead4..2ec4f7d 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieClientOnCopyOnWriteStorage.java
@@ -577,6 +577,8 @@ public class TestHoodieClientOnCopyOnWriteStorage extends 
HoodieClientTestBase {
 client = getHoodieWriteClient(newConfig);
 client.restoreToInstant("004");
 
+
assertFalse(metaClient.reloadActiveTimeline().getRollbackTimeline().lastInstant().isPresent());
+
 // Check the entire dataset has all records still
 String[] fullPartitionPaths = new 
String[dataGen.getPartitionPaths().length];
 for (int i = 0; i < fullPartitionPaths.length; i++) {


[hudi] branch release-0.10.1-rc1 updated (eac1a02 -> 358b43e)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from eac1a02  [HUDI-3195] Fix spark 3 pom (#4555)
 new b7658bc  [HUDI-3157] Remove aws jars from hudi bundles (#4542)
 new 2b98d90  [HUDI-3112] Fix KafkaConnect cannot sync to Hive Problem 
(#4458)
 new b6fde07  Removing rollbacks instants from timeline for restore 
operation (#4518)
 new c39e342  [HUDI-2735] Allow empty commits in Kafka Connect Sink for 
Hudi (#4544)
 new 44a85c7  [MINOR] Fix port number in setupKafka.sh (#4546)
 new 358b43e  [HUDI-3148] Create pushgateway client based on port (#4497)

The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../prometheus/PushGatewayMetricsReporter.java |  3 +-
 .../metrics/prometheus/PushGatewayReporter.java| 19 --
 .../action/restore/BaseRestoreActionExecutor.java  | 10 ++
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  2 ++
 hudi-kafka-connect/demo/setupKafka.sh  |  2 +-
 .../transaction/ConnectTransactionCoordinator.java |  4 +--
 .../hudi/connect/utils/KafkaConnectUtils.java  | 31 
 .../hudi/connect/writers/KafkaConnectConfigs.java  | 16 +
 .../writers/KafkaConnectTransactionServices.java   | 25 +++--
 .../connect/TestConnectTransactionCoordinator.java | 42 --
 packaging/hudi-spark-bundle/pom.xml|  4 ---
 packaging/hudi-utilities-bundle/pom.xml|  4 ---
 12 files changed, 126 insertions(+), 36 deletions(-)


[hudi] 02/06: [HUDI-3112] Fix KafkaConnect cannot sync to Hive Problem (#4458)

2022-01-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.10.1-rc1
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 2b98d909886c34a8a64c520f459cdf9a1a3a9c6c
Author: Thinking Chen 
AuthorDate: Mon Jan 10 07:31:57 2022 +0800

[HUDI-3112] Fix KafkaConnect cannot sync to Hive Problem (#4458)
---
 .../hudi/connect/utils/KafkaConnectUtils.java  | 31 ++
 .../hudi/connect/writers/KafkaConnectConfigs.java  | 16 +++
 .../writers/KafkaConnectTransactionServices.java   | 25 ++---
 3 files changed, 62 insertions(+), 10 deletions(-)

diff --git 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
index cc37de2..6a38430 100644
--- 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
+++ 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
@@ -32,6 +32,8 @@ import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.connect.ControlMessage;
 import org.apache.hudi.connect.writers.KafkaConnectConfigs;
 import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.hive.HiveSyncConfig;
+import org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor;
 import org.apache.hudi.keygen.BaseKeyGenerator;
 import org.apache.hudi.keygen.CustomAvroKeyGenerator;
 import org.apache.hudi.keygen.CustomKeyGenerator;
@@ -57,6 +59,7 @@ import java.security.MessageDigest;
 import java.security.NoSuchAlgorithmException;
 import java.util.Arrays;
 import java.util.ArrayList;
+import java.util.Collections;
 import java.util.List;
 import java.util.Map;
 import java.util.Objects;
@@ -266,4 +269,32 @@ public class KafkaConnectUtils {
 ControlMessage.ConnectWriteStatus connectWriteStatus = 
participantInfo.getWriteStatus();
 return 
SerializationUtils.deserialize(connectWriteStatus.getSerializedWriteStatus().toByteArray());
   }
+
+  /**
+   * Build Hive Sync Config
+   * Note: This method is a temporary solution.
+   * Future solutions can be referred to: 
https://issues.apache.org/jira/browse/HUDI-3199
+   */
+  public static HiveSyncConfig buildSyncConfig(TypedProperties props, String 
tableBasePath) {
+HiveSyncConfig hiveSyncConfig = new HiveSyncConfig();
+hiveSyncConfig.basePath = tableBasePath;
+hiveSyncConfig.usePreApacheInputFormat = 
props.getBoolean(KafkaConnectConfigs.HIVE_USE_PRE_APACHE_INPUT_FORMAT, false);
+hiveSyncConfig.databaseName = 
props.getString(KafkaConnectConfigs.HIVE_DATABASE, "default");
+hiveSyncConfig.tableName = props.getString(KafkaConnectConfigs.HIVE_TABLE, 
"");
+hiveSyncConfig.hiveUser = props.getString(KafkaConnectConfigs.HIVE_USER, 
"");
+hiveSyncConfig.hivePass = props.getString(KafkaConnectConfigs.HIVE_PASS, 
"");
+hiveSyncConfig.jdbcUrl = props.getString(KafkaConnectConfigs.HIVE_URL, "");
+hiveSyncConfig.partitionFields = 
props.getStringList(KafkaConnectConfigs.HIVE_PARTITION_FIELDS, ",", 
Collections.emptyList());
+hiveSyncConfig.partitionValueExtractorClass =
+
props.getString(KafkaConnectConfigs.HIVE_PARTITION_EXTRACTOR_CLASS, 
SlashEncodedDayPartitionValueExtractor.class.getName());
+hiveSyncConfig.useJdbc = 
props.getBoolean(KafkaConnectConfigs.HIVE_USE_JDBC, true);
+if (props.containsKey(KafkaConnectConfigs.HIVE_SYNC_MODE)) {
+  hiveSyncConfig.syncMode = 
props.getString(KafkaConnectConfigs.HIVE_SYNC_MODE);
+}
+hiveSyncConfig.autoCreateDatabase = 
props.getBoolean(KafkaConnectConfigs.HIVE_AUTO_CREATE_DATABASE, true);
+hiveSyncConfig.ignoreExceptions = 
props.getBoolean(KafkaConnectConfigs.HIVE_IGNORE_EXCEPTIONS, false);
+hiveSyncConfig.skipROSuffix = 
props.getBoolean(KafkaConnectConfigs.HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE,
 false);
+hiveSyncConfig.supportTimestamp = 
props.getBoolean(KafkaConnectConfigs.HIVE_SUPPORT_TIMESTAMP_TYPE, false);
+return hiveSyncConfig;
+  }
 }
diff --git 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectConfigs.java
 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectConfigs.java
index 1200779..e4543c6 100644
--- 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectConfigs.java
+++ 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectConfigs.java
@@ -164,6 +164,22 @@ public class KafkaConnectConfigs extends HoodieConfig {
 return getString(HADOOP_HOME);
   }
 
+  public static final String HIVE_USE_PRE_APACHE_INPUT_FORMAT = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format";
+  public static final String HIVE_DATABASE = 
"hoodie.datasource.hive_sync.database";
+  public static final String HIVE_TABLE = "hoodie.datasource.hive_sync.table";
+  public static final String HIVE_USER = 

[jira] [Updated] (HUDI-2450) Make Flink MOR table & Kafka Connect writing streaming friendly

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2450:
-
Summary: Make Flink MOR table & Kafka Connect writing streaming friendly  
(was: Make Flink MOR table writing streaming friendly)

> Make Flink MOR table & Kafka Connect writing streaming friendly
> ---
>
> Key: HUDI-2450
> URL: https://issues.apache.org/jira/browse/HUDI-2450
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2450) Make Flink MOR table writing streaming friendly

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2450:
-
Fix Version/s: 0.11.0

> Make Flink MOR table writing streaming friendly
> ---
>
> Key: HUDI-2450
> URL: https://issues.apache.org/jira/browse/HUDI-2450
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2450) Make Flink MOR table writing streaming friendly

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2450:
-
Priority: Blocker  (was: Major)

> Make Flink MOR table writing streaming friendly
> ---
>
> Key: HUDI-2450
> URL: https://issues.apache.org/jira/browse/HUDI-2450
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4557: [WIP] Allow pass rollbackUsingMarkers to Hudi CLI rollback command

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4557:
URL: https://github.com/apache/hudi/pull/4557#issuecomment-1009564348


   
   ## CI report:
   
   * fd69378e681c656bbef48e5112d2c7152d1646b0 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5078)
 
   * 3b59011da87037f698e5a08baf0a9af38a87a339 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5085)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4557: [WIP] Allow pass rollbackUsingMarkers to Hudi CLI rollback command

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4557:
URL: https://github.com/apache/hudi/pull/4557#issuecomment-1009607766


   
   ## CI report:
   
   * 3b59011da87037f698e5a08baf0a9af38a87a339 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5085)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1296:


Assignee: Alexey Kudinkin  (was: Manoj Govindassamy)

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1009605269


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * 08c493bdeb441e150177fdaf3cbc5e4bcfabc42a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4910)
 
   * e47f03caabfd7af8588cce120989e0370111ae62 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4515: [HUDI-3158] Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4515:
URL: https://github.com/apache/hudi/pull/4515#issuecomment-1005896164


   
   ## CI report:
   
   * 88c372fa74626363b1611825aa2d62e7ff292677 UNKNOWN
   * 1a657de1aa76f54015a405357cfd65ca1244bb19 UNKNOWN
   * 08c493bdeb441e150177fdaf3cbc5e4bcfabc42a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4910)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1885) Support Delete/Update Non-Pk Table

2022-01-10 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1885:
-
Priority: Blocker  (was: Critical)

> Support Delete/Update Non-Pk Table
> --
>
> Key: HUDI-1885
> URL: https://issues.apache.org/jira/browse/HUDI-1885
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-10 Thread Manoj Govindassamy (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472453#comment-17472453
 ] 

Manoj Govindassamy commented on HUDI-1296:
--

Related: https://issues.apache.org/jira/browse/HUDI-2644

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3177) Support CREATE INDEX statement

2022-01-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3177:
--
Sprint: Hudi-Sprint-Jan-10

> Support CREATE INDEX statement
> --
>
> Key: HUDI-3177
> URL: https://issues.apache.org/jira/browse/HUDI-3177
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Users should be able to trigger index creation using CREATE INDEX statement 
> for one or more partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-10 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-1296:
-
Sprint: Hudi-Sprint-Jan-10

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2644) Integrate existing curves with stats from the metadata table

2022-01-10 Thread Manoj Govindassamy (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472452#comment-17472452
 ] 

Manoj Govindassamy commented on HUDI-2644:
--

This looks like a DUP of https://issues.apache.org/jira/browse/HUDI-1296

> Integrate existing curves with stats from the metadata table
> 
>
> Key: HUDI-2644
> URL: https://issues.apache.org/jira/browse/HUDI-2644
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Probably need to move HoodieTable#updateStatistics to under 
> HoodieTableMetadata? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-10 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-1296:
-
Story Points: 4

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3176) Add index commit metadata

2022-01-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3176:
--
Sprint: Hudi-Sprint-Jan-10

> Add index commit metadata
> -
>
> Key: HUDI-3176
> URL: https://issues.apache.org/jira/browse/HUDI-3176
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> We need index request metadata at the time of index planning and index commit 
> metadata at the time of index execution, which should tell which metadata 
> partitions are planned/pending to be indexed and not available. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-10 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy reassigned HUDI-1296:


Assignee: Manoj Govindassamy

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3174) Implement metadata filesystem view changes to support INDEX action type

2022-01-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3174:
--
Sprint: Hudi-Sprint-Jan-10

> Implement metadata filesystem view changes to support INDEX action type
> ---
>
> Key: HUDI-3174
> URL: https://issues.apache.org/jira/browse/HUDI-3174
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Handle pending index action while listing partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3175) Support INDEX action in write client

2022-01-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3175:
--
Sprint: Hudi-Sprint-Jan-10

> Support INDEX action in write client
> 
>
> Key: HUDI-3175
> URL: https://issues.apache.org/jira/browse/HUDI-3175
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Add a new WriteOperationType and handle conflicts with concurrent writer or 
> any other async table service. Implement the protocol in HUDI-2488



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2689) Support for snapshot query on COW table

2022-01-10 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2689:
--
Status: Resolved  (was: Patch Available)

> Support for snapshot query on COW table
> ---
>
> Key: HUDI-2689
> URL: https://issues.apache.org/jira/browse/HUDI-2689
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`

2022-01-10 Thread GitBox


hudi-bot commented on pull request #4560:
URL: https://github.com/apache/hudi/pull/4560#issuecomment-1009598426


   
   ## CI report:
   
   * 09c7cc0d987f4c5efa45006631857dc1ee2b9a32 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5087)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4560: [WIP] Fixing generic usages for `HoodieRecordPayload`

2022-01-10 Thread GitBox


hudi-bot removed a comment on pull request #4560:
URL: https://github.com/apache/hudi/pull/4560#issuecomment-1009573756


   
   ## CI report:
   
   * 09c7cc0d987f4c5efa45006631857dc1ee2b9a32 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5087)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3207) Hudi Trino connector PR review

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3207:
-
Sprint: Hudi-Sprint-Jan-10

> Hudi Trino connector PR review
> --
>
> Key: HUDI-3207
> URL: https://issues.apache.org/jira/browse/HUDI-3207
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3207) Hudi Trino connector PR review

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3207:
-
Reviewers: Vinoth Chandar

> Hudi Trino connector PR review
> --
>
> Key: HUDI-3207
> URL: https://issues.apache.org/jira/browse/HUDI-3207
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields

2022-01-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2968:
-
Sprint: Hudi-Sprint-Jan-10

> Support Delete/Update using non-pk fields
> -
>
> Key: HUDI-2968
> URL: https://issues.apache.org/jira/browse/HUDI-2968
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Allow to delete/update using non-pk fields
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi 
> options (primaryKey = 'id');
> update h0 set price = 10 where name = 'foo'; 
> delete from h0 where name = 'foo';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   6   >