date:20220804

[GitHub] [hudi] hudi-bot commented on pull request #6311: [HUDI-4548] Unpack the column max/min to string instead of Utf8 for M…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6311:
URL: https://github.com/apache/hudi/pull/6311#issuecomment-1206109906

   
   ## CI report:
   
   * 17848d0f924115607c4144b3fa0a218333e89c99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10602)
 
   * 04f067fce6df4225c497caeecd63dba7d069ba75 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10606)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206109860

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   * 666088efaacc584a5f36db4df2f44f358e1ba53c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10599)
 
   * 14aa4355ee414a6cb4814950216fe5ea93ccba16 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10605)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6311: [HUDI-4548] Unpack the column max/min to string instead of Utf8 for M…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6311:
URL: https://github.com/apache/hudi/pull/6311#issuecomment-1206106505

   
   ## CI report:
   
   * 17848d0f924115607c4144b3fa0a218333e89c99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10602)
 
   * 04f067fce6df4225c497caeecd63dba7d069ba75 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206106447

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   * 666088efaacc584a5f36db4df2f44f358e1ba53c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10599)
 
   * 14aa4355ee414a6cb4814950216fe5ea93ccba16 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6310: [HUDI-4474] Fix inferring props for meta sync

2022-08-04 Thread GitBox



hudi-bot commented on PR #6310:
URL: https://github.com/apache/hudi/pull/6310#issuecomment-1206103377

   
   ## CI report:
   
   * 366dc59d094ffcdd05ba7cdf905b85cb684a9fa7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10601)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6309: [HUDI-4547] Fix SortOperatorGen sort indices

2022-08-04 Thread GitBox



hudi-bot commented on PR #6309:
URL: https://github.com/apache/hudi/pull/6309#issuecomment-1206103348

   
   ## CI report:
   
   * f6df4432d24639619566565e3fac86cbd855ce9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10600)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-08-04 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206102844

   
   ## CI report:
   
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-4551) The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is the parallelism of the execution environment

2022-08-04 Thread Nicholas Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Jiang reassigned HUDI-4551:


Assignee: Nicholas Jiang

> The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is the 
> parallelism of the execution environment
> --
>
> Key: HUDI-4551
> URL: https://issues.apache.org/jira/browse/HUDI-4551
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Minor
>
> The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is 4, which 
> could be the parallelism of the execution environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4550) Investigate why rollback is triggered for completed instant

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4550:
--
Fix Version/s: 0.13.0

> Investigate why rollback is triggered for completed instant
> ---
>
> Key: HUDI-4550
> URL: https://issues.apache.org/jira/browse/HUDI-4550
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Priority: Major
> Fix For: 0.13.0
>
>
> See issue [https://github.com/apache/hudi/issues/6224]
> Ideally, rollback should not be triggered for a completed instant. But, if it 
> does then it should be safe to fallback to listing based rollback.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4551) The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is the parallelism of the execution environment

2022-08-04 Thread Nicholas Jiang (Jira)

Nicholas Jiang created HUDI-4551:


 Summary: The default value of READ_TASKS, WRITE_TASKS, 
CLUSTERING_TASKS is the parallelism of the execution environment
 Key: HUDI-4551
 URL: https://issues.apache.org/jira/browse/HUDI-4551
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Nicholas Jiang


The default value of READ_TASKS, WRITE_TASKS, CLUSTERING_TASKS is 4, which 
could be the parallelism of the execution environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (e03cd0a198 -> fcdd4cf06c)

2022-08-04 Thread garyli

This is an automated email from the ASF dual-hosted git repository.

garyli pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e03cd0a198 [HUDI-4545] Do not modify the current record directly for 
OverwriteNonDefaultsWithLatestAvroPayload (#6306)
 add fcdd4cf06c [HUDI-4544] support retain hour cleaning policy for flink 
(#6300)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/configuration/FlinkOptions.java | 8 
 .../main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java   | 7 +++
 .../src/main/java/org/apache/hudi/util/StreamerUtil.java  | 1 +
 3 files changed, 16 insertions(+)

[GitHub] [hudi] garyli1019 merged pull request #6300: [HUDI-4544] support retain hour cleaning policy for flink

2022-08-04 Thread GitBox



garyli1019 merged PR #6300:
URL: https://github.com/apache/hudi/pull/6300


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on issue #6224: [SUPPORT] Caused by: java.lang.IllegalArgumentException: Cannot use marker based rollback strategy on completed instant

2022-08-04 Thread GitBox



codope commented on issue #6224:
URL: https://github.com/apache/hudi/issues/6224#issuecomment-1206097133

   @jtchen-study Ideally, rollback is triggered only for failed writes. As such 
fallback to listing-based rollback should be safe but we need to understand how 
rollback got triggered for completed instant. Can you describe the sequence of 
events that happened and also the setup. Steps to reproduce would be very 
helpful. Is this a single-writer or multi-writer scenario?
   I've create HUDI-4550 to track the investigation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4550) Investigate why rollback is triggered for completed instant

2022-08-04 Thread Sagar Sumit (Jira)

Sagar Sumit created HUDI-4550:
-

 Summary: Investigate why rollback is triggered for completed 
instant
 Key: HUDI-4550
 URL: https://issues.apache.org/jira/browse/HUDI-4550
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit


See issue [https://github.com/apache/hudi/issues/6224]

Ideally, rollback should not be triggered for a completed instant. But, if it 
does then it should be safe to fallback to listing based rollback.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4536) ClusteringOperator causes the NullPointerException when writing with BulkInsertWriterHelper in clustering

2022-08-04 Thread Nicholas Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Jiang closed HUDI-4536.

 Reviewers: Danny Chen
Resolution: Fixed

> ClusteringOperator causes the NullPointerException when writing with 
> BulkInsertWriterHelper in clustering
> -
>
> Key: HUDI-4536
> URL: https://issues.apache.org/jira/browse/HUDI-4536
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> ClusteringOperator causes the NullPointerException when writing with 
> BulkInsertWriterHelper for clustering, because the BulkInsertWriterHelper 
> isn't set to null after close.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-4545] Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload (#6306)

2022-08-04 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e03cd0a198 [HUDI-4545] Do not modify the current record directly for 
OverwriteNonDefaultsWithLatestAvroPayload (#6306)
e03cd0a198 is described below

commit e03cd0a198f63df7fb7ba71d1c9a0b01ae33f021
Author: Danny Chan 
AuthorDate: Fri Aug 5 14:16:53 2022 +0800

[HUDI-4545] Do not modify the current record directly for 
OverwriteNonDefaultsWithLatestAvroPayload (#6306)
---
 .../model/OverwriteNonDefaultsWithLatestAvroPayload.java  |  8 ++--
 .../model/TestOverwriteNonDefaultsWithLatestAvroPayload.java  | 11 +--
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteNonDefaultsWithLatestAvroPayload.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteNonDefaultsWithLatestAvroPayload.java
index 93ac96cb42..6ce99aae21 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteNonDefaultsWithLatestAvroPayload.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteNonDefaultsWithLatestAvroPayload.java
@@ -20,6 +20,7 @@ package org.apache.hudi.common.model;
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.GenericRecordBuilder;
 import org.apache.avro.generic.IndexedRecord;
 
 import org.apache.hudi.common.util.Option;
@@ -60,16 +61,19 @@ public class OverwriteNonDefaultsWithLatestAvroPayload 
extends OverwriteWithLate
 if (isDeleteRecord(insertRecord)) {
   return Option.empty();
 } else {
+  final GenericRecordBuilder builder = new GenericRecordBuilder(schema);
   List fields = schema.getFields();
   fields.forEach(field -> {
 Object value = insertRecord.get(field.name());
 value = field.schema().getType().equals(Schema.Type.STRING) && value 
!= null ? value.toString() : value;
 Object defaultValue = field.defaultVal();
 if (!overwriteField(value, defaultValue)) {
-  currentRecord.put(field.name(), value);
+  builder.set(field, value);
+} else {
+  builder.set(field, currentRecord.get(field.pos()));
 }
   });
-  return Option.of(currentRecord);
+  return Option.of(builder.build());
 }
   }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteNonDefaultsWithLatestAvroPayload.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteNonDefaultsWithLatestAvroPayload.java
index c6eee05b87..9e3405b304 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteNonDefaultsWithLatestAvroPayload.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/model/TestOverwriteNonDefaultsWithLatestAvroPayload.java
@@ -22,6 +22,7 @@ import org.apache.avro.JsonProperties;
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericData;
 import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 
@@ -31,6 +32,7 @@ import java.util.Collections;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertNotSame;
 
 /**
  * Unit tests {@link TestOverwriteNonDefaultsWithLatestAvroPayload}.
@@ -85,8 +87,13 @@ public class TestOverwriteNonDefaultsWithLatestAvroPayload {
 assertEquals(record1, payload1.getInsertValue(schema).get());
 assertEquals(record2, payload2.getInsertValue(schema).get());
 
-assertEquals(payload1.combineAndGetUpdateValue(record2, schema).get(), 
record1);
-assertEquals(payload2.combineAndGetUpdateValue(record1, schema).get(), 
record3);
+IndexedRecord combinedVal1 = payload1.combineAndGetUpdateValue(record2, 
schema).get();
+assertEquals(combinedVal1, record1);
+assertNotSame(combinedVal1, record1);
+
+IndexedRecord combinedVal2 = payload2.combineAndGetUpdateValue(record1, 
schema).get();
+assertEquals(combinedVal2, record3);
+assertNotSame(combinedVal2, record3);
   }
 
   @Test

[jira] [Commented] (HUDI-4545) Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload

2022-08-04 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575596#comment-17575596
 ] 

Danny Chen commented on HUDI-4545:
--

Fixed via master branch: e03cd0a198f63df7fb7ba71d1c9a0b01ae33f021

> Do not modify the current record directly for 
> OverwriteNonDefaultsWithLatestAvroPayload
> ---
>
> Key: HUDI-4545
> URL: https://issues.apache.org/jira/browse/HUDI-4545
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.12.0
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently, we use short-cut logic:
> {code:java}
> a == b
> // for example: HoodieMergeHandle#writeUpdateRecord
> {code}
> to decide whether the update happens, in principle, we should not modify the 
> records from disk directly, they should be kept as immutable, for any 
> changes, we should return new records instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4505) Returns instead of throws if lock file exists for FileSystemBasedLockProvider

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4505:
--
Priority: Blocker  (was: Major)

> Returns instead of throws if lock file exists for FileSystemBasedLockProvider
> -
>
> Key: HUDI-4505
> URL: https://issues.apache.org/jira/browse/HUDI-4505
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: image-2022-07-29-15-33-04-206.png
>
>
> To avoid the verbose log like below:
>  
> !image-2022-07-29-15-33-04-206.png|width=755,height=269!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4504) Disable metadata table by default for flink

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4504:
--
Priority: Blocker  (was: Major)

> Disable metadata table by default for flink
> ---
>
> Key: HUDI-4504
> URL: https://issues.apache.org/jira/browse/HUDI-4504
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4545) Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload

2022-08-04 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4545.
--

> Do not modify the current record directly for 
> OverwriteNonDefaultsWithLatestAvroPayload
> ---
>
> Key: HUDI-4545
> URL: https://issues.apache.org/jira/browse/HUDI-4545
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.12.0
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently, we use short-cut logic:
> {code:java}
> a == b
> // for example: HoodieMergeHandle#writeUpdateRecord
> {code}
> to decide whether the update happens, in principle, we should not modify the 
> records from disk directly, they should be kept as immutable, for any 
> changes, we should return new records instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 merged pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



danny0405 merged PR #6306:
URL: https://github.com/apache/hudi/pull/6306


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



danny0405 commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1206083880

   The failed it should not be affected by this patch and it succeed i last 
run: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10593&view=logs&s=859b8d9a-8fd6-5a5c-6f5e-f84f1990894e,
   so i would just merge the PR then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1206072734

   
   ## CI report:
   
   * 04e513ba7885d107713277a0a7964c3a082d7405 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10598)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6311: [HUDI-4548] Unpack the column max/min to string instead of Utf8 for M…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6311:
URL: https://github.com/apache/hudi/pull/6311#issuecomment-1206068121

   
   ## CI report:
   
   * 17848d0f924115607c4144b3fa0a218333e89c99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10602)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6310: [HUDI-4474] Fix inferring props for meta sync

2022-08-04 Thread GitBox



hudi-bot commented on PR #6310:
URL: https://github.com/apache/hudi/pull/6310#issuecomment-1206068104

   
   ## CI report:
   
   * 366dc59d094ffcdd05ba7cdf905b85cb684a9fa7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10601)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6309: [HUDI-4547] Fix SortOperatorGen sort indices

2022-08-04 Thread GitBox



hudi-bot commented on PR #6309:
URL: https://github.com/apache/hudi/pull/6309#issuecomment-1206068086

   
   ## CI report:
   
   * f6df4432d24639619566565e3fac86cbd855ce9d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10600)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206068076

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   * 666088efaacc584a5f36db4df2f44f358e1ba53c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10599)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6227: [HUDI-4496] Fixing Orc support broken for Spark 3.x and more

2022-08-04 Thread GitBox



codope commented on code in PR #6227:
URL: https://github.com/apache/hudi/pull/6227#discussion_r938462895


##
hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala:
##
@@ -223,6 +215,20 @@ private[sql] class AvroSerializer(
 val numFields = st.length
 (getter, ordinal) => structConverter(getter.getStruct(ordinal, 
numFields))
 
+  

+  // Following section is amended to the original (Spark's) implementation
+  // >>> BEGINS
+  

+
+  case (st: StructType, UNION) =>

Review Comment:
   Very good point! Sounds good. Let's make sure that the annotation is common 
across code, easily searchable. Probably, we should put this in coding 
guidelines as well? 
https://hudi.apache.org/contribute/developer-setup#coding-guidelines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on pull request #6248: [HUDI-4303] Adding 4 to 5 upgrade handler to check for old deprecated "default" partition value

2022-08-04 Thread GitBox



YuweiXiao commented on PR #6248:
URL: https://github.com/apache/hudi/pull/6248#issuecomment-1206062611

   Hey @nsivabalan, just wondering why we are changing the default partition 
values. Is it only a new standard or other systems rely on this (like query 
engine)? 
   
   Also, what if the user's partition has the name as `default`? Seems we 
cannot even verify it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6311: [HUDI-4548] Unpack the column max/min to string instead of Utf8 for M…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6311:
URL: https://github.com/apache/hudi/pull/6311#issuecomment-1206065938

   
   ## CI report:
   
   * 17848d0f924115607c4144b3fa0a218333e89c99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6310: [HUDI-4474] Fix inferring props for meta sync

2022-08-04 Thread GitBox



hudi-bot commented on PR #6310:
URL: https://github.com/apache/hudi/pull/6310#issuecomment-1206065914

   
   ## CI report:
   
   * 366dc59d094ffcdd05ba7cdf905b85cb684a9fa7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6309: [HUDI-4547] Fix SortOperatorGen sort indices

2022-08-04 Thread GitBox



hudi-bot commented on PR #6309:
URL: https://github.com/apache/hudi/pull/6309#issuecomment-1206065893

   
   ## CI report:
   
   * f6df4432d24639619566565e3fac86cbd855ce9d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206065867

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   * 666088efaacc584a5f36db4df2f44f358e1ba53c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206063538

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

2022-08-04 Thread GitBox



hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206063311

   
   ## CI report:
   
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure

2022-08-04 Thread GitBox



Gatsby-Lee commented on issue #6024:
URL: https://github.com/apache/hudi/issues/6024#issuecomment-1206062572

   @codope hi, 
   
   First, as of 0.11.x, DELETE_PARTITION ( in AWS Glue Catalog ) doesn't fail 
or raise exception. ( It's different from 0.10.x )
   Second, like you said the actual delete is done by cleaner ( lazy ), but 
before the actual delete, Hudi seems to try to delete metadata in AWS Glue 
Catalog first.
   Third, org_id=5 has never existed.
   
   I will try to replicate the issue with 0.11.1 and post the output here.
   ( I don't remember if I reproduced this issue with 0.11.0 or not. Anyway, I 
will try again )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4549) hive sync bundle causes class loader issue

2022-08-04 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-4549:


 Summary: hive sync bundle causes class loader issue
 Key: HUDI-4549
 URL: https://issues.apache.org/jira/browse/HUDI-4549
 Project: Apache Hudi
  Issue Type: Bug
  Components: dependencies
Reporter: Raymond Xu
 Fix For: 0.12.0


A weird classpath issue i found: when testing deltastreamer using 
hudi-utilities-slim-bundle, if i put --jars 
hudi-hive-sync-bundle.jar,hudi-spark-bundle.jar then i’ll get this error when 
writing

{code:java}
Caused by: java.lang.NoSuchMethodError: 
org.apache.hudi.avro.MercifulJsonConverter.convert(Ljava/lang/String;Lorg/apache/avro/Schema;)Lorg/apache/avro/generic/GenericRecord;
at 
org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJson(AvroConvertor.java:86)
at 
org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
{code}

if i put the spark bundle before the hive sync bundle, then no issue. Without 
hive-sync-bundle, also no issue. So hive-sync-bundle somehow messes up with 
classpath? not sure why it reports a hudi-common API not found… caused by 
shading avro?


the same behavior i observed with aws-bundle, which makes sense, as it’s a 
superset of hive-sync-bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4548) Unpack the column max/min to string instead of Utf8 for Mor table

2022-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4548:
-
Labels: pull-request-available  (was: )

> Unpack the column max/min to string instead of Utf8 for Mor table
> -
>
> Key: HUDI-4548
> URL: https://issues.apache.org/jira/browse/HUDI-4548
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.12.0
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 opened a new pull request, #6311: [HUDI-4548] Unpack the column max/min to string instead of Utf8 for M…

2022-08-04 Thread GitBox



danny0405 opened a new pull request, #6311:
URL: https://github.com/apache/hudi/pull/6311

   …or table
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4548) Unpack the column max/min to string instead of Utf8 for Mor table

2022-08-04 Thread Danny Chen (Jira)

Danny Chen created HUDI-4548:


 Summary: Unpack the column max/min to string instead of Utf8 for 
Mor table
 Key: HUDI-4548
 URL: https://issues.apache.org/jira/browse/HUDI-4548
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Affects Versions: 0.12.0
Reporter: Danny Chen
 Fix For: 0.12.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xushiyan opened a new pull request, #6310: [HUDI-4474] Fix inferring props for meta sync

2022-08-04 Thread GitBox



xushiyan opened a new pull request, #6310:
URL: https://github.com/apache/hudi/pull/6310

   - `HoodieConfig#setDefaults` looks up declared fields, so should pass static 
class for reflection, otherwise, subclasses of HoodieSyncConfig won't set 
defaults properly
   - Pass all write client configs of deltastreamer to meta sync 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4547) Partition sorting does not take effect when use bucket_insert.

2022-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4547:
-
Labels: pull-request-available  (was: )

> Partition sorting does not take effect when use bucket_insert.
> --
>
> Key: HUDI-4547
> URL: https://issues.apache.org/jira/browse/HUDI-4547
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: HunterHunter
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hudi/issues/6301



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] LinMingQiang opened a new pull request, #6309: [HUDI-4547] fix Partition sorting does not take effect when use bucke…

2022-08-04 Thread GitBox



LinMingQiang opened a new pull request, #6309:
URL: https://github.com/apache/hudi/pull/6309

   …t_insert.
   
   Signed-off-by: HunterXHunter <1356469...@qq.com>
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   https://github.com/apache/hudi/issues/6301
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4385) Support to trigger the compaction in the flink batch mode.

2022-08-04 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4385:
-
Fix Version/s: 0.12.0
   (was: 0.13.0)

> Support to trigger the compaction in the flink batch mode. 
> ---
>
> Key: HUDI-4385
> URL: https://issues.apache.org/jira/browse/HUDI-4385
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: HunterHunter
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Configure parameter `compaction.batch.mode.enabled` to decide whether to 
> enable offline `compaction`, users no longer need to perform `offline 
> compaction` separately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4348) merge into will cause data quality in concurrent scene

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4348:
--
Fix Version/s: 0.12.0

> merge into will cause data quality in concurrent scene
> --
>
> Key: HUDI-4348
> URL: https://issues.apache.org/jira/browse/HUDI-4348
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> a hudi table with 15 billion pieces of data, the update records has 30 
> million every day, the 1000 records is different with hive table.
>  
> when I set `executor-cores 1` and `spark.task.cpus 1`, there is no problem, 
> but when the parallelism over 1 in every executor, the data quality will 
> appear.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4348) merge into will cause data quality in concurrent scene

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-4348.
-
Resolution: Fixed

> merge into will cause data quality in concurrent scene
> --
>
> Key: HUDI-4348
> URL: https://issues.apache.org/jira/browse/HUDI-4348
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> a hudi table with 15 billion pieces of data, the update records has 30 
> million every day, the 1000 records is different with hive table.
>  
> when I set `executor-cores 1` and `spark.task.cpus 1`, there is no problem, 
> but when the parallelism over 1 in every executor, the data quality will 
> appear.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-4217) improve repeat init object in ExpressionPayload

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-4217.
-
Resolution: Fixed

> improve repeat init object in ExpressionPayload
> ---
>
> Key: HUDI-4217
> URL: https://issues.apache.org/jira/browse/HUDI-4217
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
> Attachments: flamegraph4.svg, image-2022-06-10-10-07-45-715.png
>
>
> ExpressionPayload will repeat init object in the same schema, it cost lots of 
> cpu time



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4348) merge into will cause data quality in concurrent scene

2022-08-04 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4348:
--
Priority: Blocker  (was: Major)

> merge into will cause data quality in concurrent scene
> --
>
> Key: HUDI-4348
> URL: https://issues.apache.org/jira/browse/HUDI-4348
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: KnightChess
>Assignee: KnightChess
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> a hudi table with 15 billion pieces of data, the update records has 30 
> million every day, the 1000 records is different with hive table.
>  
> when I set `executor-cores 1` and `spark.task.cpus 1`, there is no problem, 
> but when the parallelism over 1 in every executor, the data quality will 
> appear.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4541) Flink job fails with column stats enabled in metadata table due to NotSerializableException

2022-08-04 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575564#comment-17575564
 ] 

Danny Chen commented on HUDI-4541:
--

You can try per-job submission mode instead.

> Flink job fails with column stats enabled in metadata table due to 
> NotSerializableException

> 
>
> Key: HUDI-4541
> URL: https://issues.apache.org/jira/browse/HUDI-4541
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink-sql
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-08-04 at 17.10.05.png
>
>
> Environment: EMR 6.7.0 Flink 1.14.2
> Reproducible steps: Build Hudi Flink bundle from master
> {code:java}
> mvn clean package -DskipTests  -pl :hudi-flink1.14-bundle -am {code}
> Copy to EMR master node /lib/flink/lib
> Launch Flink SQL client:
> {code:java}
> cd /lib/flink && ./bin/yarn-session.sh --detached
> ./bin/sql-client.sh {code}
> Run the following from the Flink quick start guide with metadata table, 
> column stats, and data skipping enabled
> {code:java}
> CREATE TABLE t1(
>   uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
>   name VARCHAR(10),
>   age INT,
>   ts TIMESTAMP(3),
>   `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'hudi',
>   'path' = 's3a://',
>   'table.type' = 'MERGE_ON_READ', -- this creates a MERGE_ON_READ table, by 
> default is COPY_ON_WRITE
>   'metadata.enabled' = 'true', -- enables multi-modal index and metadata table
>   'hoodie.metadata.index.column.stats.enable' = 'true', -- enables column 
> stats in metadata table
>   'read.data.skipping.enabled' = 'true' -- enables data skipping
> );
> INSERT INTO t1 VALUES
>   ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
>   ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
>   ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
>   ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
>   ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
>   ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
>   ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
>   ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); {code}
> !Screen Shot 2022-08-04 at 17.10.05.png|width=1130,height=463!
> Exception:
> {code:java}
> 2022-08-04 17:04:41
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
>     at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
>     at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
>     at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
>     at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
>     at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
>     at 
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
>     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
>     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>     at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>     at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>     at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>     at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>     at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1206034476

   
   ## CI report:
   
   * 137f2e09f90bc9f179f3c94c844be7be5e5f2325 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10593)
 
   * 04e513ba7885d107713277a0a7964c3a082d7405 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10598)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1206032240

   
   ## CI report:
   
   * 137f2e09f90bc9f179f3c94c844be7be5e5f2325 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10593)
 
   * 04e513ba7885d107713277a0a7964c3a082d7405 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-08-04 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206031953

   
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486)
 
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10597)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4547) Partition sorting does not take effect when use bucket_insert.

2022-08-04 Thread HunterHunter (Jira)

HunterHunter created HUDI-4547:
--

 Summary: Partition sorting does not take effect when use 
bucket_insert.
 Key: HUDI-4547
 URL: https://issues.apache.org/jira/browse/HUDI-4547
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink
Reporter: HunterHunter


https://github.com/apache/hudi/issues/6301



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] eric9204 opened a new issue, #6308: [SUPPORT] Spark multi writer failed，seems like clazz conflict ! ! !

2022-08-04 Thread GitBox



eric9204 opened a new issue, #6308:
URL: https://github.com/apache/hudi/issues/6308

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



hudi-bot commented on PR #6267:
URL: https://github.com/apache/hudi/pull/6267#issuecomment-1206029678

   
   ## CI report:
   
   * 43899bb9bf0456c877213ca8bf8641d8258d6903 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10594)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6246: [HUDI-4543] able to disable precombine field when table schema contains a field named ts

2022-08-04 Thread GitBox



hudi-bot commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1206029611

   
   ## CI report:
   
   * 7b04e73fecb574e199a3aad9e74dd6c9ae45d123 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10592)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1206029749

   
   ## CI report:
   
   * 137f2e09f90bc9f179f3c94c844be7be5e5f2325 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10593)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

2022-08-04 Thread GitBox



hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206029481

   
   ## CI report:
   
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541)
 
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10596)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-08-04 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1206029376

   
   ## CI report:
   
   * dfd50cd0007c4ff48b3e0e27c368d573e47560a2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10486)
 
   * 5a6ac9622379715e890f1ec1cd7be9422febeb5c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiaozhch5 closed issue #6301: [SUPPORT] Flink uses bulk_insert mode to load the data from hdfs file to hudi very slow.

2022-08-04 Thread GitBox



xiaozhch5 closed issue #6301: [SUPPORT] Flink uses bulk_insert mode to load the 
data from hdfs file to hudi very slow.
URL: https://github.com/apache/hudi/issues/6301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6141: [HUDI-3189] Fallback to full table scan with incremental query when files are cleaned up or achived for MOR table

2022-08-04 Thread GitBox



hudi-bot commented on PR #6141:
URL: https://github.com/apache/hudi/pull/6141#issuecomment-1206007248

   
   ## CI report:
   
   * 23f96b3ecc8812ffae7f9e692e883cdabba03eb0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10541)
 
   * 2a493fcafb42e21cbfcae3787ab30853319f4bf3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on issue #6126: [SUPPORT] Hudi Table(MOR) not getting created from Flink Sql Client Shell

2022-08-04 Thread GitBox



yuzhaojing commented on issue #6126:
URL: https://github.com/apache/hudi/issues/6126#issuecomment-1206006030

   Sure, I will try to reproduce it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206003270

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10595)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-08-04 Thread GitBox



wzx140 commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r938412210


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/util/HoodieSparkRecordUtils.java:
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.util;
+
+import org.apache.hudi.HoodieInternalRowUtils;
+import org.apache.hudi.commmon.model.HoodieSparkRecord;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieOperation;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.keygen.RowKeyGeneratorHelper;
+
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+
+import scala.Tuple2;
+
+public class HoodieSparkRecordUtils {
+
+  /**
+   * Utility method to convert bytes to HoodieRecord using schema and payload 
class.
+   */
+  public static HoodieRecord 
convertToHoodieSparkRecord(InternalRow data, StructType structType) {
+return new HoodieSparkRecord(data, structType);
+  }
+
+  /**
+   * Utility method to convert InternalRow to HoodieRecord using schema and 
payload class.
+   */
+  public static HoodieRecord 
convertToHoodieSparkRecord(StructType structType, InternalRow data, String 
preCombineField, boolean withOperationField) {
+return convertToHoodieSparkRecord(structType, data, preCombineField,
+Pair.of(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
HoodieRecord.PARTITION_PATH_METADATA_FIELD),
+withOperationField, Option.empty());
+  }
+
+  public static HoodieRecord 
convertToHoodieSparkRecord(StructType structType, InternalRow data, String 
preCombineField, boolean withOperationField,
+  Option partitionName) {
+return convertToHoodieSparkRecord(structType, data, preCombineField,
+Pair.of(HoodieRecord.RECORD_KEY_METADATA_FIELD, 
HoodieRecord.PARTITION_PATH_METADATA_FIELD),
+withOperationField, partitionName);
+  }
+
+  /**
+   * Utility method to convert bytes to HoodieRecord using schema and payload 
class.
+   */
+  public static HoodieRecord 
convertToHoodieSparkRecord(StructType structType, InternalRow data, String 
preCombineField, Pair recordKeyPartitionPathFieldPair,
+  boolean withOperationField, Option partitionName) {
+final String recKey = getValue(structType, 
recordKeyPartitionPathFieldPair.getKey(), data).toString();
+final String partitionPath = (partitionName.isPresent() ? 
partitionName.get() :
+getValue(structType, recordKeyPartitionPathFieldPair.getRight(), 
data).toString());
+
+Object preCombineVal = getPreCombineVal(structType, data, preCombineField);

Review Comment:
   @alexeykudinkin Thank you for your advice, I will study it carefully. This 
sounds reasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-08-04 Thread GitBox



wzx140 commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r938408127


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordMerger.java:
##
@@ -30,9 +34,19 @@
  * It can implement the merging logic of HoodieRecord of different engines
  * and avoid the performance consumption caused by the 
serialization/deserialization of Avro payload.
  */
-public interface HoodieMerge extends Serializable {
-  
-  HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
+@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
+public interface HoodieRecordMerger extends Serializable {
+
+  /**
+   * This method converges combineAndGetUpdateValue and precombine from 
HoodiePayload.
+   * It'd be associative operation: f(a, f(b, c)) = f(f(a, b), c) (which we 
can translate as having 3 versions A, B, C
+   * of the single record, both orders of operations applications have to 
yield the same result)
+   */
+  Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException;

Review Comment:
   @alexeykudinkin Maybe vc means merge function will deduplicate the records 
for insertion.  Do you think we should put the shouldCombine marker in the 
record? cc @vinothchandar 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



hudi-bot commented on PR #6307:
URL: https://github.com/apache/hudi/pull/6307#issuecomment-1206001322

   
   ## CI report:
   
   * 5e75dee8c56cb14110b33548c09aad222adc57d2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure

2022-08-04 Thread GitBox



codope commented on issue #6024:
URL: https://github.com/apache/hudi/issues/6024#issuecomment-1205997925

   Btw, `org_id=5_\$folder$` maybe an S3 thing. Did the partition 
`org_id=5` ever existed before? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] trushev commented on a diff in pull request #6276: [HUDI-4523] Sequential submitting of flink jobs leads to java.net.ConnectException

2022-08-04 Thread GitBox



trushev commented on code in PR #6276:
URL: https://github.com/apache/hudi/pull/6276#discussion_r938401579


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java:
##
@@ -124,10 +124,8 @@ public FileSystemViewManager getViewManager() {
 return viewManager;
   }
 
-  public boolean canReuseFor(String basePath) {
-return this.server != null
-&& this.viewManager != null
-&& this.basePath.equals(basePath);
+  public boolean canReuse() {
+return this.server != null && this.viewManager != null;
   }

Review Comment:
   I've checked the same timeservice for different `basePath`. It works on my 
test. Do you think we need to reuse the same timeline service for different 
path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4546) Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread leesf (Jira)

leesf created HUDI-4546:
---

 Summary: Optimize catalog cast logic in HoodieSpark3Analysis
 Key: HUDI-4546
 URL: https://issues.apache.org/jira/browse/HUDI-4546
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: leesf
Assignee: leesf


In HoodieSpark3Analysis, if it is CreateV2Table, there is no need to cast the 
HoodieCatalog since CreateV2Table contains TableCatalog and we would use it 
directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-08-04 Thread Yao Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Zhang updated HUDI-4485:

Attachment: spring-shell-1.2.0.RELEASE.jar

> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version : 3.1.1
>Reporter: Yao Zhang
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: spring-shell-1.2.0.RELEASE.jar
>
>
> This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
> fsview all · Issue #6177 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/6177]
> **Describe the problem you faced**
> Hudi cli got empty result after running command show fsview all.
> ![image](https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png)
> The type of table t1 is  COW and I am sure that the parquet file is actually 
> generated inside data folder. Also, the parquet files are not damaged as the 
> data could be retrieved correctly by reading as Hudi table or directly 
> reading each parquet file(using Spark).
> **To Reproduce**
> Steps to reproduce the behavior:
> 1. Enter Flink SQL client.
> 2. Execute the SQL and check the data was written successfully.
> ```sql
> CREATE TABLE t1(
>   uuid VARCHAR(20),
>   name VARCHAR(10),
>   age INT,
>   ts TIMESTAMP(3),
>   `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'hudi',
>   'path' = 'hdfs:///path/to/table/',
>   'table.type' = 'COPY_ON_WRITE'
> );
> -- insert data using values
> INSERT INTO t1 VALUES
>   ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
>   ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
>   ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
>   ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
>   ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
>   ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
>   ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
>   ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
> ```
> 3. Enter Hudi cli and execute `show fsview all`
> **Expected behavior**
> `show fsview all` in Hudi cli should return all file slices.
> **Environment Description**
> * Hudi version : 0.11.1
> * Spark version : 3.1.1
> * Hive version : 3.1.0
> * Hadoop version : 3.1.1
> * Storage (HDFS/S3/GCS..) : HDFS
> * Running on Docker? (yes/no) : no
> **Additional context**
> No.
> **Stacktrace**
> N/A
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4546) Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4546:
-
Labels: pull-request-available  (was: )

> Optimize catalog cast logic in HoodieSpark3Analysis
> ---
>
> Key: HUDI-4546
> URL: https://issues.apache.org/jira/browse/HUDI-4546
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
>
> In HoodieSpark3Analysis, if it is CreateV2Table, there is no need to cast the 
> HoodieCatalog since CreateV2Table contains TableCatalog and we would use it 
> directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] leesf opened a new pull request, #6307: [HUDI-4546] Optimize catalog cast logic in HoodieSpark3Analysis

2022-08-04 Thread GitBox



leesf opened a new pull request, #6307:
URL: https://github.com/apache/hudi/pull/6307

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-08-04 Thread Yao Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Zhang updated HUDI-4485:

Description: 
This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
fsview all · Issue #6177 · apache/hudi 
(github.com)|https://github.com/apache/hudi/issues/6177]

{*}{{*}}Describe the problem you faced{{*}}{*}

Hudi cli got empty result after running command show fsview all.

![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])

The type of table t1 is COW and I am sure that the parquet file is actually 
generated inside data folder. Also, the parquet files are not damaged as the 
data could be retrieved correctly by reading as Hudi table or directly reading 
each parquet file(using Spark).

{*}{{*}}To Reproduce{{*}}{*}

Steps to reproduce the behavior:

1. Enter Flink SQL client.
2. Execute the SQL and check the data was written successfully.
```sql
CREATE TABLE t1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs:///path/to/table/',
'table.type' = 'COPY_ON_WRITE'
);

– insert data using values
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
```
3. Enter Hudi cli and execute `show fsview all`

{*}{{*}}Expected behavior{{*}}{*}

`show fsview all` in Hudi cli should return all file slices.

{*}{{*}}Environment Description{{*}}{*}
 * Hudi version : 0.11.1

 * Spark version : 3.1.1

 * Hive version : 3.1.0

 * Hadoop version : 3.1.1

 * Storage (HDFS/S3/GCS..) : HDFS

 * Running on Docker? (yes/no) : no

{*}{{*}}Additional context{{*}}{*}

No.

{*}{{*}}Stacktrace{{*}}{*}

N/A

 

Temporary solution：

I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the 
attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.

  was:
This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
fsview all · Issue #6177 · apache/hudi 
(github.com)|https://github.com/apache/hudi/issues/6177]

*{*}Describe the problem you faced{*}*

Hudi cli got empty result after running command show fsview all.

![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])

The type of table t1 is COW and I am sure that the parquet file is actually 
generated inside data folder. Also, the parquet files are not damaged as the 
data could be retrieved correctly by reading as Hudi table or directly reading 
each parquet file(using Spark).

*{*}To Reproduce{*}*

Steps to reproduce the behavior:

1. Enter Flink SQL client.
2. Execute the SQL and check the data was written successfully.
```sql
CREATE TABLE t1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs:///path/to/table/',
'table.type' = 'COPY_ON_WRITE'
);

– insert data using values
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
```
3. Enter Hudi cli and execute `show fsview all`

*{*}Expected behavior{*}*

`show fsview all` in Hudi cli should return all file slices.

*{*}Environment Description{*}*
 * Hudi version : 0.11.1

 * Spark version : 3.1.1

 * Hive version : 3.1.0

 * Hadoop version : 3.1.1

 * Storage (HDFS/S3/GCS..) : HDFS

 * Running on Docker? (yes/no) : no

*{*}Additional context{*}*

No.

*{*}Stacktrace{*}*

N/A

 

Temporary solution：

I modified and reocmpiled spring-shell 1.2.0.RELEASE. Please download the 
attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.


> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version :

[jira] [Updated] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-08-04 Thread Yao Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Zhang updated HUDI-4485:

Description: 
This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
fsview all · Issue #6177 · apache/hudi 
(github.com)|https://github.com/apache/hudi/issues/6177]

*{*}Describe the problem you faced{*}*

Hudi cli got empty result after running command show fsview all.

![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])

The type of table t1 is COW and I am sure that the parquet file is actually 
generated inside data folder. Also, the parquet files are not damaged as the 
data could be retrieved correctly by reading as Hudi table or directly reading 
each parquet file(using Spark).

*{*}To Reproduce{*}*

Steps to reproduce the behavior:

1. Enter Flink SQL client.
2. Execute the SQL and check the data was written successfully.
```sql
CREATE TABLE t1(
uuid VARCHAR(20),
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs:///path/to/table/',
'table.type' = 'COPY_ON_WRITE'
);

– insert data using values
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
```
3. Enter Hudi cli and execute `show fsview all`

*{*}Expected behavior{*}*

`show fsview all` in Hudi cli should return all file slices.

*{*}Environment Description{*}*
 * Hudi version : 0.11.1

 * Spark version : 3.1.1

 * Hive version : 3.1.0

 * Hadoop version : 3.1.1

 * Storage (HDFS/S3/GCS..) : HDFS

 * Running on Docker? (yes/no) : no

*{*}Additional context{*}*

No.

*{*}Stacktrace{*}*

N/A

 

Temporary solution：

I modified and reocmpiled spring-shell 1.2.0.RELEASE. Please download the 
attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.

  was:
This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
fsview all · Issue #6177 · apache/hudi 
(github.com)|https://github.com/apache/hudi/issues/6177]

**Describe the problem you faced**

Hudi cli got empty result after running command show fsview all.

![image](https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png)

The type of table t1 is  COW and I am sure that the parquet file is actually 
generated inside data folder. Also, the parquet files are not damaged as the 
data could be retrieved correctly by reading as Hudi table or directly reading 
each parquet file(using Spark).

**To Reproduce**

Steps to reproduce the behavior:

1. Enter Flink SQL client.
2. Execute the SQL and check the data was written successfully.
```sql
CREATE TABLE t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs:///path/to/table/',
  'table.type' = 'COPY_ON_WRITE'
);

-- insert data using values
INSERT INTO t1 VALUES
  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
```
3. Enter Hudi cli and execute `show fsview all`

**Expected behavior**

`show fsview all` in Hudi cli should return all file slices.

**Environment Description**

* Hudi version : 0.11.1

* Spark version : 3.1.1

* Hive version : 3.1.0

* Hadoop version : 3.1.1

* Storage (HDFS/S3/GCS..) : HDFS

* Running on Docker? (yes/no) : no


**Additional context**

No.

**Stacktrace**

N/A


 


> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version : 3.1.1
>Reporter: Yao Zhang
>Priority: Minor
> Fix For: 0.13.0
>
> Attachments: spring-shell-1.2.0.RELEASE.jar
>
>
> This issue is from: [[SUPPORT] Hudi cli got empt

[GitHub] [hudi] hudi-bot commented on pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



hudi-bot commented on PR #6267:
URL: https://github.com/apache/hudi/pull/6267#issuecomment-1205974002

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 43899bb9bf0456c877213ca8bf8641d8258d6903 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10594)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] flashJd commented on pull request #6246: [HUDI-4543] able to disable precombine field when table schema contains a field named ts

2022-08-04 Thread GitBox



flashJd commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1205972641

   > 
   I've filed a Jira ticket and changed the commit titile, also fixed the 
checkStyle conflicts, can you help approval the workflow, thanks.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] flashJd commented on pull request #6246: [HUDI-4543] able to disable precombine field when table schema contains a field named ts

2022-08-04 Thread GitBox



flashJd commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1205972885

   > Thanks for the contribution, can we log a JIRA issue and change the commit 
titile to form like: `[HUDI-${JIRA issue ID}] ${your actual commit title}`.
   
   I've filed a Jira ticket and changed the commit titile, also fixed the 
checkStyle conflicts, can you help approval the workflow, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1205971247

   
   ## CI report:
   
   * 137f2e09f90bc9f179f3c94c844be7be5e5f2325 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10593)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



hudi-bot commented on PR #6267:
URL: https://github.com/apache/hudi/pull/6267#issuecomment-1205971169

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 43899bb9bf0456c877213ca8bf8641d8258d6903 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



Zouxxyy commented on PR #6267:
URL: https://github.com/apache/hudi/pull/6267#issuecomment-1205969782

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



Zouxxyy commented on code in PR #6267:
URL: https://github.com/apache/hudi/pull/6267#discussion_r938389569


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -248,17 +248,17 @@ private Pair> 
getFilesToCleanKeepingLatestVersions(
 
   while (fileSliceIterator.hasNext() && keepVersions > 0) {
 // Skip this most recent version
+fileSliceIterator.next();
+keepVersions--;
+  }
+  // Delete the remaining files
+  while (fileSliceIterator.hasNext()) {

Review Comment:
   I added a test case, hope it helps



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-04 Thread GitBox



YannByron commented on PR #6267:
URL: https://github.com/apache/hudi/pull/6267#issuecomment-1205968918

   @nsivabalan  please trigger to run the workflows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



hudi-bot commented on PR #6306:
URL: https://github.com/apache/hudi/pull/6306#issuecomment-1205968594

   
   ## CI report:
   
   * 137f2e09f90bc9f179f3c94c844be7be5e5f2325 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6246: [HUDI-4543] able to disable precombine field when table schema contains a field named ts

2022-08-04 Thread GitBox



hudi-bot commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1205962815

   
   ## CI report:
   
   * 39773f8cf8f7a8441963080fd43d8ea04f1a74c9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10586)
 
   * 7b04e73fecb574e199a3aad9e74dd6c9ae45d123 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4543) can't enable the proc_time/natural order sequence semantics when a ts field exists in the table schema

2022-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4543:
-
Labels: pull-request-available  (was: )

> can't enable the proc_time/natural order sequence semantics when a ts field 
> exists in the table schema
> --
>
> Key: HUDI-4543
> URL: https://issues.apache.org/jira/browse/HUDI-4543
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4545) Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload

2022-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4545:
-
Labels: pull-request-available  (was: )

> Do not modify the current record directly for 
> OverwriteNonDefaultsWithLatestAvroPayload
> ---
>
> Key: HUDI-4545
> URL: https://issues.apache.org/jira/browse/HUDI-4545
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.12.0
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> Currently, we use short-cut logic:
> {code:java}
> a == b
> // for example: HoodieMergeHandle#writeUpdateRecord
> {code}
> to decide whether the update happens, in principle, we should not modify the 
> records from disk directly, they should be kept as immutable, for any 
> changes, we should return new records instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6246: [HUDI-4543] able to disable precombine field when table schema contains a field named ts

2022-08-04 Thread GitBox



hudi-bot commented on PR #6246:
URL: https://github.com/apache/hudi/pull/6246#issuecomment-1205965800

   
   ## CI report:
   
   * 39773f8cf8f7a8441963080fd43d8ea04f1a74c9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10586)
 
   * 7b04e73fecb574e199a3aad9e74dd6c9ae45d123 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10592)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 opened a new pull request, #6306: [HUDI-4545] Do not modify the current record directly for OverwriteNo…

2022-08-04 Thread GitBox



danny0405 opened a new pull request, #6306:
URL: https://github.com/apache/hudi/pull/6306

   …nDefaultsWithLatestAvroPayload
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4545) Do not modify the current record directly for OverwriteNonDefaultsWithLatestAvroPayload

2022-08-04 Thread Danny Chen (Jira)

Danny Chen created HUDI-4545:


 Summary: Do not modify the current record directly for 
OverwriteNonDefaultsWithLatestAvroPayload
 Key: HUDI-4545
 URL: https://issues.apache.org/jira/browse/HUDI-4545
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Affects Versions: 0.12.0
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.12.0


Currently, we use short-cut logic:
{code:java}
a == b
// for example: HoodieMergeHandle#writeUpdateRecord
{code}

to decide whether the update happens, in principle, we should not modify the 
records from disk directly, they should be kept as immutable, for any changes, 
we should return new records instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] 01/02: update dynamodb lockk provider docs to include iam and additional dependencies

2022-08-04 Thread wenningd

This is an automated email from the ASF dual-hosted git repository.

wenningd pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit c47f323765b1abb5196b16d0224b5baa0940bc7e
Author: atharvai 
AuthorDate: Thu Jul 21 11:11:09 2022 +0100

update dynamodb lockk provider docs to include iam and additional 
dependencies
---
 website/docs/concurrency_control.md | 33 +
 website/docs/configurations.md  |  2 +-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index e71cb4a8f2..689b7632f7 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -78,7 +78,10 @@ 
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLoc
 hoodie.write.lock.dynamodb.table
 hoodie.write.lock.dynamodb.partition_key
 hoodie.write.lock.dynamodb.region
+hoodie.write.lock.dynamodb.endpoint_url
+hoodie.write.lock.dynamodb.billing_mode
 ```
+
 Also, to set up the credentials for accessing AWS resources, customers can 
pass the following props to Hudi jobs:
 ```
 hoodie.aws.access.key
@@ -87,6 +90,36 @@ hoodie.aws.session.token
 ```
 If not configured, Hudi falls back to use 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
+
+IAM policy for your service instance will need to add the following 
permissions:
+
+```json
+{
+  "Sid":"DynamoDBLocksTable",
+  "Effect": "Allow",
+  "Action": [
+"dynamodb:CreateTable",
+"dynamodb:DeleteItem",
+"dynamodb:DescribeTable",
+"dynamodb:GetItem",
+"dynamodb:PutItem",
+"dynamodb:Scan",
+"dynamodb:UpdateItem"
+  ],
+  "Resource": 
"arn:${Partition}:dynamodb:${Region}:${Account}:table/${TableName}"
+}
+```
+- `TableName` : same as `hoodie.write.lock.dynamodb.partition_key`
+- `Region`: same as `hoodie.write.lock.dynamodb.region`
+
+AWS SDK dependencies are not bundled with Hudi from v0.10.x and will need to 
be added to your classpath. 
+Add the following Maven packages (check the latest versions at time of 
install):
+```
+com.amazonaws:dynamodb-lock-client
+com.amazonaws:aws-java-sdk-dynamodb
+com.amazonaws:aws-java-sdk-core
+```
+
 ## Datasource Writer
 
 The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table.
diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index b92e40f06c..6dcb76179b 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -1696,7 +1696,7 @@ Configs that control DynamoDB based locking mechanisms 
required for concurrency
 
 `Config Class`: org.apache.hudi.config.DynamoDbBasedLockConfig
 >  hoodie.write.lock.dynamodb.billing_mode
-> For DynamoDB based lock provider, by default it is PAY_PER_REQUEST 
mode
+> For DynamoDB based lock provider, by default it is PAY_PER_REQUEST mode. 
Alternative is PROVISIONED
 > **Default Value**: PAY_PER_REQUEST (Optional)
 > `Config Param: DYNAMODB_LOCK_BILLING_MODE`
 > `Since Version: 0.10.0`

[hudi] 02/02: update versioned docs for dynamodb lock provider docs to include iam and additional dependencies

2022-08-04 Thread wenningd

This is an automated email from the ASF dual-hosted git repository.

wenningd pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit d89c0ce798f3842ab325ad2c7b879e554cf18214
Author: atharvai 
AuthorDate: Thu Jul 21 11:18:29 2022 +0100

update versioned docs for dynamodb lock provider docs to include iam and 
additional dependencies
---
 website/docs/concurrency_control.md|  1 -
 .../version-0.10.0/concurrency_control.md  | 32 +
 .../version-0.10.1/concurrency_control.md  | 33 +-
 .../version-0.11.0/concurrency_control.md  | 33 +-
 .../version-0.11.1/concurrency_control.md  | 33 +-
 5 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index 689b7632f7..25a523ee7c 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -90,7 +90,6 @@ hoodie.aws.session.token
 ```
 If not configured, Hudi falls back to use 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
-
 IAM policy for your service instance will need to add the following 
permissions:
 
 ```json
diff --git a/website/versioned_docs/version-0.10.0/concurrency_control.md 
b/website/versioned_docs/version-0.10.0/concurrency_control.md
index a9a0d5860c..fe38f102cd 100644
--- a/website/versioned_docs/version-0.10.0/concurrency_control.md
+++ b/website/versioned_docs/version-0.10.0/concurrency_control.md
@@ -78,7 +78,10 @@ 
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLoc
 hoodie.write.lock.dynamodb.table
 hoodie.write.lock.dynamodb.partition_key
 hoodie.write.lock.dynamodb.region
+hoodie.write.lock.dynamodb.endpoint_url
+hoodie.write.lock.dynamodb.billing_mode
 ```
+
 Also, to set up the credentials for accessing AWS resources, customers can 
pass the following props to Hudi jobs:
 ```
 hoodie.aws.access.key
@@ -87,6 +90,35 @@ hoodie.aws.session.token
 ```
 If not configured, Hudi falls back to use 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
+IAM policy for your service instance will need to add the following 
permissions:
+
+```json
+{
+  "Sid":"DynamoDBLocksTable",
+  "Effect": "Allow",
+  "Action": [
+"dynamodb:CreateTable",
+"dynamodb:DeleteItem",
+"dynamodb:DescribeTable",
+"dynamodb:GetItem",
+"dynamodb:PutItem",
+"dynamodb:Scan",
+"dynamodb:UpdateItem"
+  ],
+  "Resource": 
"arn:${Partition}:dynamodb:${Region}:${Account}:table/${TableName}"
+}
+```
+- `TableName` : same as `hoodie.write.lock.dynamodb.partition_key`
+- `Region`: same as `hoodie.write.lock.dynamodb.region`
+
+AWS SDK dependencies are not bundled with Hudi from v0.10.x and will need to 
be added to your classpath.
+Add the following Maven packages (check the latest versions at time of 
install):
+```
+com.amazonaws:dynamodb-lock-client
+com.amazonaws:aws-java-sdk-dynamodb
+com.amazonaws:aws-java-sdk-core
+```
+
 ## Datasource Writer
 
 The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table.
diff --git a/website/versioned_docs/version-0.10.1/concurrency_control.md 
b/website/versioned_docs/version-0.10.1/concurrency_control.md
index a9a0d5860c..6377c762bd 100644
--- a/website/versioned_docs/version-0.10.1/concurrency_control.md
+++ b/website/versioned_docs/version-0.10.1/concurrency_control.md
@@ -70,7 +70,6 @@ hoodie.write.lock.hivemetastore.table
 `The HiveMetastore URI's are picked up from the hadoop configuration file 
loaded during runtime.`
 
 **`Amazon DynamoDB`** based lock provider
-
 Amazon DynamoDB based lock provides a simple way to support multi writing 
across different clusters
 
 ```
@@ -78,7 +77,10 @@ 
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLoc
 hoodie.write.lock.dynamodb.table
 hoodie.write.lock.dynamodb.partition_key
 hoodie.write.lock.dynamodb.region
+hoodie.write.lock.dynamodb.endpoint_url
+hoodie.write.lock.dynamodb.billing_mode
 ```
+
 Also, to set up the credentials for accessing AWS resources, customers can 
pass the following props to Hudi jobs:
 ```
 hoodie.aws.access.key
@@ -87,6 +89,35 @@ hoodie.aws.session.token
 ```
 If not configured, Hudi falls back to use 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
+IAM policy for your service instance will need to add the following 
permissions:
+
+```json
+{
+  "Sid":"DynamoDBLocksTable",
+  "Effect": "Allow",
+  "Action": [
+"dynamodb:CreateTable",
+"dynamodb:DeleteItem",
+"dynamodb:DescribeTable",
+"dynamodb:GetItem",
+"dynamodb:PutIte

[hudi] branch asf-site updated (5a69d734b6 -> d89c0ce798)

2022-08-04 Thread wenningd

This is an automated email from the ASF dual-hosted git repository.

wenningd pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 5a69d734b6 GitHub Actions build asf-site
 new c47f323765 update dynamodb lockk provider docs to include iam and 
additional dependencies
 new d89c0ce798 update versioned docs for dynamodb lock provider docs to 
include iam and additional dependencies

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 website/docs/concurrency_control.md| 32 +
 website/docs/configurations.md |  2 +-
 .../version-0.10.0/concurrency_control.md  | 32 +
 .../version-0.10.1/concurrency_control.md  | 33 +-
 .../version-0.11.0/concurrency_control.md  | 33 +-
 .../version-0.11.1/concurrency_control.md  | 33 +-
 6 files changed, 161 insertions(+), 4 deletions(-)

[GitHub] [hudi] zhedoubushishi merged pull request #6168: [DOCS] Update aws dynamodb lock provider docs

2022-08-04 Thread GitBox



zhedoubushishi merged PR #6168:
URL: https://github.com/apache/hudi/pull/6168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhedoubushishi commented on pull request #6168: [DOCS] Update aws dynamodb lock provider docs

2022-08-04 Thread GitBox



zhedoubushishi commented on PR #6168:
URL: https://github.com/apache/hudi/pull/6168#issuecomment-1205962690

   @atharvai Thanks for updating the doc! LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-08-04 Thread GitBox



XuQianJin-Stars commented on code in PR #6284:
URL: https://github.com/apache/hudi/pull/6284#discussion_r938382783


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -92,11 +92,12 @@ protected HoodieMergedLogRecordScanner(FileSystem fs, 
String basePath, List(maxMemorySizeInBytes, 
spillableMapBasePath, new DefaultSizeEstimator(),
+  this.records = new ExternalSpillableMap<>(maxMemorySizeInBytes, basePath 
+ spillableMapBasePath, new DefaultSizeEstimator(),
   new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+

Review Comment:
   > not sure if we can do this. spillableMapbase path is configurable. If one 
does not want "/tmp/" which is the default, they can always override using the 
configs.
   
   It is more troublesome for users to use, and this option needs to be 
specified additionally, and considering that the `basepath` will definitely be 
in a large data disk, the temporary directory can be put into the same disk as 
the `basepath`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] XuQianJin-Stars commented on a diff in pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-08-04 Thread GitBox



XuQianJin-Stars commented on code in PR #6284:
URL: https://github.com/apache/hudi/pull/6284#discussion_r938382783


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -92,11 +92,12 @@ protected HoodieMergedLogRecordScanner(FileSystem fs, 
String basePath, List(maxMemorySizeInBytes, 
spillableMapBasePath, new DefaultSizeEstimator(),
+  this.records = new ExternalSpillableMap<>(maxMemorySizeInBytes, basePath 
+ spillableMapBasePath, new DefaultSizeEstimator(),
   new HoodieRecordSizeEstimator(readerSchema), diskMapType, 
isBitCaskDiskMapCompressionEnabled);
+

Review Comment:
   > not sure if we can do this. spillableMapbase path is configurable. If one 
does not want "/tmp/" which is the default, they can always override using the 
configs.
   
   It is more troublesome for users to use, and this option needs to be 
specified additionally, and considering that the basepath will definitely be in 
a large data disk, the temporary directory can be put into the same disk as the 
basepath.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6157: [HUDI-4431] Fix log file will not roll over to a new file

2022-08-04 Thread GitBox



nsivabalan commented on code in PR #6157:
URL: https://github.com/apache/hudi/pull/6157#discussion_r938380933


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java:
##
@@ -94,7 +94,8 @@ private FSDataOutputStream getOutputStream() throws 
IOException, InterruptedExce
   Path path = logFile.getPath();
   if (fs.exists(path)) {
 boolean isAppendSupported = 
StorageSchemes.isAppendSupported(fs.getScheme());
-if (isAppendSupported) {
+boolean needRollOverToNewFile = fs.getFileStatus(path).getLen() > 
sizeThreshold;
+if (isAppendSupported && !needRollOverToNewFile) {

Review Comment:
   @XuQianJin-Stars : any updates on this end. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [DOCS] add description about clean policy based on hours (#6215)

2022-08-04 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 80c2f59190 [DOCS] add description about clean policy based on hours 
(#6215)
80c2f59190 is described below

commit 80c2f591908680b8cb1d7c2f815a37840af1ee15
Author: feiyang_deepnova <736320...@qq.com>
AuthorDate: Fri Aug 5 09:50:48 2022 +0800

[DOCS] add description about clean policy based on hours (#6215)

Co-authored-by: linfey 
---
 website/versioned_docs/version-0.11.1/hoodie_cleaner.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/website/versioned_docs/version-0.11.1/hoodie_cleaner.md 
b/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
index 10f1aa2450..34c1cf11d1 100644
--- a/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
+++ b/website/versioned_docs/version-0.11.1/hoodie_cleaner.md
@@ -23,6 +23,9 @@ disk for at least 5 hours, thereby preventing the longest 
running query from fai
 This policy is useful when it is known how many MAX versions of the file does 
one want to keep at any given time. 
 To achieve the same behaviour as before of preventing long running queries 
from failing, one should do their calculations 
 based on data patterns. Alternatively, this policy is also useful if a user 
just wants to maintain 1 latest version of the file.
+- **KEEP_LATEST_BY_HOURS**: This policy clean up based on hours.It is simple 
and useful when knowing that you want to keep files at any given time.
+  Corresponding to commits with commit times older than the configured number 
of hours to be retained are cleaned.
+  Currently you can configure by parameter 'hoodie.cleaner.hours.retained'.
 
 ### Configurations
 For details about all possible configurations and their default values see the 
[configuration 
docs](https://hudi.apache.org/docs/configurations#Compaction-Configs).

[GitHub] [hudi] nsivabalan merged pull request #6215: [DOCS] Added the description of the cleaning policy

2022-08-04 Thread GitBox



nsivabalan merged PR #6215:
URL: https://github.com/apache/hudi/pull/6215


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #6216: [HUDI-4475] fix create table with not exists hoodie properties file

2022-08-04 Thread GitBox



nsivabalan commented on PR #6216:
URL: https://github.com/apache/hudi/pull/6216#issuecomment-1205950765

   we added support to update hoodie.properties in a live env. mainly to update 
some table propertlies like metadata related props (list of partitions in 
metadata table). So, here is how upgrade works so that its fault tolerant and 
recoverable. 
   
   orig.hoodie.properties
   
   Step1:
   take back up. 
   cp orig.hoodie.properties backup.hoodie.properties. 
   
   Step2: 
   delete orig.hoodie.properties 
   
   Step3:
   create new hoodie.properties in memory w/ any new properties required. 
   create orig.hoodie.properties. 
   
   Step4: 
   delete backup.hoodie.properties. 
   
   b/w step2 and step3, readers will read backup.hoodie.properties.
   
   Above is designed such that, if there is a crash at any point, we are safe 
and restarting the pipeline would suffice. 
   ref: 
https://github.com/apache/hudi/blob/a75cc02273ae87c383ae1ed46f95006c366f70fc/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java#L344
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 >

1 - 100 of 262 matches

Mail list logo