[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-06 Thread GitBox


dongkelun commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r964449710


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -502,14 +502,20 @@ public void handle(@NotNull Context context) throws 
Exception {
 if (refreshCheck) {
   long beginFinalCheck = System.currentTimeMillis();
   if (isLocalViewBehind(context)) {
-String errMsg =
-"Last known instant from client was "
-+ 
context.queryParam(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS,
-HoodieTimeline.INVALID_INSTANT_TS)
-+ " but server has the following timeline "
-+ 
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM))
-
.getTimeline().getInstants().collect(Collectors.toList());
-throw new BadRequestResponse(errMsg);
+String lastInstantTs = 
context.queryParam(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS,
+HoodieTimeline.INVALID_INSTANT_TS);
+HoodieTimeline localTimeline =
+
viewManager.getFileSystemView(context.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM)).getTimeline();
+HoodieTimeline afterLastInstantTimeLine = 
localTimeline.findInstantsAfter(lastInstantTs).filterCompletedInstants();
+if (!(afterLastInstantTimeLine.countInstants() == 1

Review Comment:
   Done



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java:
##
@@ -64,11 +65,16 @@ private int getCommitsSinceLastCleaning() {
 Option lastCleanInstant = 
table.getActiveTimeline().getCleanerTimeline().filterCompletedInstants().lastInstant();
 HoodieTimeline commitTimeline = 
table.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
 
-String latestCleanTs;
-int numCommits = 0;
-if (lastCleanInstant.isPresent()) {
-  latestCleanTs = lastCleanInstant.get().getTimestamp();
-  numCommits = 
commitTimeline.findInstantsAfter(latestCleanTs).countInstants();
+int numCommits;
+if (lastCleanInstant.isPresent() && 
!table.getActiveTimeline().isEmpty(lastCleanInstant.get())) {
+  try {
+HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils
+
.deserializeHoodieCleanMetadata(table.getActiveTimeline().getInstantDetails(lastCleanInstant.get()).get());
+String lastCompletedCommitTimestamp = 
cleanMetadata.getLastCompletedCommitTimestamp();
+numCommits = 
commitTimeline.findInstantsAfter(lastCompletedCommitTimestamp).countInstants();
+  } catch (IOException e) {
+throw new HoodieIOException(e.getMessage(), e);

Review Comment:
   Good idea, Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dongkelun commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-06 Thread GitBox


dongkelun commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1238980827

   > A high level thought. I appreciate you parametrized most tests for both 
sync and async cleaning. But do keep in mind that, these are going to add to 
our test run-time. So, may be take another look to see if we need to make all 
cleaner tests run for both sync and async. Enable for those where it matters 
and for others probably running it for sync cleaning should suffice.
   
   Unnecessary asynchronous cleaning test cases have been removed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dujl commented on pull request #6525: [HUDI-4237] should not sync partition parameters when create non-partition table in spark

2022-09-06 Thread GitBox


dujl commented on PR #6525:
URL: https://github.com/apache/hudi/pull/6525#issuecomment-1238958055

   @alexeykudinkin please help to review and approve


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dujl commented on pull request #6525: [HUDI-4237] should not sync partition parameters when create non-partition table in spark

2022-09-06 Thread GitBox


dujl commented on PR #6525:
URL: https://github.com/apache/hudi/pull/6525#issuecomment-1238956445

   > @dujl can you please update the PR description w/ the crux of the issue?
   > 
   > The one in Jira is very detailed (thanks for providing it!), but it's 
important to make sure PRs also have detailed descriptions as well.
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] dujl commented on pull request #6525: [HUDI-4237] should not sync partition parameters when create non-partition table in spark

2022-09-06 Thread GitBox


dujl commented on PR #6525:
URL: https://github.com/apache/hudi/pull/6525#issuecomment-1238956173

   > 
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-09-06 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1238943910

   
   ## CI report:
   
   * 37785220f2d17a1a04d136521f10c3a0314fe448 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10970)
 
   * ecc4f73ee21eac826979c427414a8560036ceceb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11213)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ThinkerLei commented on a diff in pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-09-06 Thread GitBox


ThinkerLei commented on code in PR #6384:
URL: https://github.com/apache/hudi/pull/6384#discussion_r964412660


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java:
##
@@ -98,7 +111,10 @@ public String getFileExtension() {
   }
 
   public Path getPath() {
-return new Path(pathStr);
+if (path == null) {
+  this.path = new Path(pathStr);

Review Comment:
   @yihua Thanks for your comments and suggestions,I have taken your suggestion 
and submitted again.  I haven't compared the memory usage before and after the 
patch in time. But with four new variables we can roughly estimate the  memory 
usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6619: [HUDI-4796] MetricsReporter stop bug

2022-09-06 Thread GitBox


hudi-bot commented on PR #6619:
URL: https://github.com/apache/hudi/pull/6619#issuecomment-1238940922

   
   ## CI report:
   
   * 461a755d6938132f17243987fb7ab5e69a883f1e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11212)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-09-06 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1238940534

   
   ## CI report:
   
   * 37785220f2d17a1a04d136521f10c3a0314fe448 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10970)
 
   * ecc4f73ee21eac826979c427414a8560036ceceb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6619: [HUDI-4796] MetricsReporter stop bug

2022-09-06 Thread GitBox


hudi-bot commented on PR #6619:
URL: https://github.com/apache/hudi/pull/6619#issuecomment-1238937620

   
   ## CI report:
   
   * 461a755d6938132f17243987fb7ab5e69a883f1e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6616: Add Postgres Schema Name to Postgres Debezium Source

2022-09-06 Thread GitBox


hudi-bot commented on PR #6616:
URL: https://github.com/apache/hudi/pull/6616#issuecomment-1238934271

   
   ## CI report:
   
   * 8176e809b4f329e0cfbff75484b3595c69970207 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11202)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

2022-09-06 Thread GitBox


hudi-bot commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238934239

   
   ## CI report:
   
   * a990d7b411e5692568e548f4b31394f1fd051e77 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11201)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-06 Thread GitBox


hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1238934128

   
   ## CI report:
   
   * 277061fa910ff388b9fa580083fd3af406ce3b94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11160)
 
   * 86efca51356b9d673dc971b23e8280872c62ba41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11206)
 
   * 7ced8cc1e89594e2a074a546a165ce3ef744841f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11211)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4796) Properly release MetricsReporter resources

2022-09-06 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown updated HUDI-4796:

Status: In Progress  (was: Open)

> Properly release MetricsReporter resources
> --
>
> Key: HUDI-4796
> URL: https://issues.apache.org/jira/browse/HUDI-4796
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> In 
> [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65]
>  we are calling the close method on a class instead of the Reporter's `stop` 
> method. The `stop` method according to the Java docs "Should be used to stop 
> channels, streams and release resources." 
> For most reporters these two actions are equivalent but the 
> [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127]
>  has a more involved stop method that must be called. 
>  
> Relates to discussion 
> [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4796) Properly release MetricsReporter resources

2022-09-06 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown updated HUDI-4796:

Status: Patch Available  (was: In Progress)

> Properly release MetricsReporter resources
> --
>
> Key: HUDI-4796
> URL: https://issues.apache.org/jira/browse/HUDI-4796
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> In 
> [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65]
>  we are calling the close method on a class instead of the Reporter's `stop` 
> method. The `stop` method according to the Java docs "Should be used to stop 
> channels, streams and release resources." 
> For most reporters these two actions are equivalent but the 
> [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127]
>  has a more involved stop method that must be called. 
>  
> Relates to discussion 
> [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan commented on a diff in pull request #4718: [HUDI-3345][RFC-36] Hudi metastore server

2022-09-06 Thread GitBox


xushiyan commented on code in PR #4718:
URL: https://github.com/apache/hudi/pull/4718#discussion_r964383930


##
rfc/rfc-36/rfc-36.md:
##
@@ -0,0 +1,605 @@
+
+# RFC-36: Hudi Metastore Server
+
+## Proposers
+
+- @minihippo
+
+## Approvers
+
+
+## Status
+
+JIRA: [HUDI-3345](https://issues.apache.org/jira/browse/HUDI-3345)
+
+> Please keep the status updated in `rfc/README.md`.
+
+# Hudi Metastore Server
+
+## Abstract
+
+Currently, Hudi is widely used as a table format in the data warehouse. There 
is a lack of central metastore server to manage the metadata of data lake 
table. Hive metastore as a commonly used catalog service in the data warehouse 
on Hadoop cannot store the unique metadata like timeline of the hudi table.
+
+The proposal is to implement an unified metadata management system called hudi 
metastore server to store the metadata of the hudi table, and be compatible 
with hive metastore so that other engines can access it without any changes.
+
+## Backgroud
+
+**How Hudi metadata is stored**
+
+The metadata of hudi are table location, configuration and schema, timeline 
generated by instants, metadata of each commit / instant, which records files 
created / updated, new records num and so on in this commit. Besides, the 
information of files in a hudi table is also a part of hudi metadata.
+
+Different from instant or schema recorded by a separate file that is stored 
under `${tablelocation}/.hoodie` on the HDFS or object storage, files info are 
managed by the HDFS directly. Hudi gets all files of a table by file listing. 
File listing is a costly operation and its performance is limited by namenode. 
In addition, there will be a few invalid files on the file system, which are 
created by spark speculative tasks(for example) and are not deleted 
successfully. Getting files by listing will result in inconsistency, so hudi 
has to store the valid files from each commit metadata, the metadata about 
files is usually referred to snapshot.
+
+RFC-15 metadata table is a proposal that can solve these problems. However, it 
only manages the metadata of one table. There is a lack of a unified view.
+
+**The integration of Hive metastore and Hudi metadata lacks a single source of 
truth.**
+
+Hive metastore server is widely used as a metadata center in the data 
warehouse on Hadoop. It stores the metadata for hive tables like their schema, 
location and partitions. Currently, almost all of the storage or computing 
engines support registering table information to it, discovering and retrieving 
metadata from it. Meanwhile, cloud service providers like AWS Glue, HUAWEI 
Cloud, Google Cloud Dataproc, Alibaba Cloud, ByteDance Volcano Engine all 
provide Apache Hive metastore compatible catalog. It seems that hive metastore 
has become a standard in the data warehouse.
+
+Different from the traditional table format like hive table, the data lake 
table not only has schema, partitions and other hive metadata, but also has 
timeline, snapshot which is unconventional. Hence, the metadata of data lake 
cannot be managed by HMS directly.
+
+Hudi just syncs the schema and partitions to HMS by now, and other metadata 
still stores on HDFS or object store. Metadata synchronization between 
different metadata management systems will result in inconsistency.
+
+## Overview
+
+![architecture](architecture.png)
+
+The hudi metastore server is for metadata management of the data lake table, 
to support metadata persistency, efficient metadata access and other extensions 
for data lake. The metadata server managed includes the information of 
databases and tables, partitions, schemas, instants, instants' meta and files' 
meta.
+
+The metastore server has two main components: service and storage. The storage 
is for metadata persistency and the service is to receive the get / put 
requests from client and return / store the processing result after doing some 
logical operations on metadata.
+
+The hudi metastore server is / has
+
+- **A metastore server for data lake**
+-  Different from the traditional table format, the metadata of the data 
lake has timeline and snapshot concepts, in addition to schema and partitions.
+
+-  The metastore server is an unified metadata management system for data 
lake table.
+
+- **Pluggable storage**
+-  The storage is only responsible for metadata presistency. Therefore, 
it's doesn't matter what the storage engine is used to store the data, it can 
be a RDBMS, kv system or file system.
+
+- **Easy to be expanded**
+-  The service is stateless, so it can be scaled horizontally to support 
higher QPS. The storage can be split vertically to store more data.
+
+- **Compatible with multiple computing engines**
+-  The server has an adapter to be compatible with hive metastore server.
+
+## Design
+
+This part has four sections: what the service does, what and how the metadata 
stores, how the service interacts with the storage when reading and writi

[jira] [Updated] (HUDI-4796) Properly release MetricsReporter resources

2022-09-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4796:
-
Labels: pull-request-available  (was: )

> Properly release MetricsReporter resources
> --
>
> Key: HUDI-4796
> URL: https://issues.apache.org/jira/browse/HUDI-4796
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> In 
> [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65]
>  we are calling the close method on a class instead of the Reporter's `stop` 
> method. The `stop` method according to the Java docs "Should be used to stop 
> channels, streams and release resources." 
> For most reporters these two actions are equivalent but the 
> [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127]
>  has a more involved stop method that must be called. 
>  
> Relates to discussion 
> [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] the-other-tim-brown opened a new pull request, #6619: [HUDI-4796] MetricsReporter stop bug

2022-09-06 Thread GitBox


the-other-tim-brown opened a new pull request, #6619:
URL: https://github.com/apache/hudi/pull/6619

   ### Change Logs
   
   - Removes a confusing method, `getReporter()` in the abstract class 
MetricsReporter since we want to be calling `stop` the MetricsReporter 
instances to make sure they are cleaned up properly
   - Updates Metrics.java to call `stop` method
   - Updates implementations of MetricsReporter to no longer implement 
getReporter method
   - Fixes import order in classes that were touched as part of this PR
   
   ### Impact
   
   The API for the `MetricsReporter` is changing due to the removal of the 
`getReporter` method
   
   **Risk level: low**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4796) Properly release MetricsReporter resources

2022-09-06 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown reassigned HUDI-4796:
---

Assignee: Timothy Brown

> Properly release MetricsReporter resources
> --
>
> Key: HUDI-4796
> URL: https://issues.apache.org/jira/browse/HUDI-4796
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>
> In 
> [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65]
>  we are calling the close method on a class instead of the Reporter's `stop` 
> method. The `stop` method according to the Java docs "Should be used to stop 
> channels, streams and release resources." 
> For most reporters these two actions are equivalent but the 
> [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127]
>  has a more involved stop method that must be called. 
>  
> Relates to discussion 
> [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4796) Properly release MetricsReporter resources

2022-09-06 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-4796:
---

 Summary: Properly release MetricsReporter resources
 Key: HUDI-4796
 URL: https://issues.apache.org/jira/browse/HUDI-4796
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Timothy Brown


In 
[Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65]
 we are calling the close method on a class instead of the Reporter's `stop` 
method. The `stop` method according to the Java docs "Should be used to stop 
channels, streams and release resources." 

For most reporters these two actions are equivalent but the 
[JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127]
 has a more involved stop method that must be called. 

 

Relates to discussion 
[here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan commented on pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2022-09-06 Thread GitBox


xushiyan commented on PR #5064:
URL: https://github.com/apache/hudi/pull/5064#issuecomment-1238907156

   @prasannarajaperumal Other than the RPC protocol consideration as @minihippo 
mentioned, with Thrift generated models we'll have flexibilities in adapting 
with different metastores / catalogs like AWS glue, datahub, etc. I discussed 
with @minihippo separately on having sth like hudi-metastore-proxy-bundle.jar 
to sync to those catalogs. This hasn't been added to the current RFC doc. 
@minihippo is working on publishing an updated RFC including all the planned 
future capabilities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-06 Thread GitBox


hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1238906690

   
   ## CI report:
   
   * 277061fa910ff388b9fa580083fd3af406ce3b94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11160)
 
   * 86efca51356b9d673dc971b23e8280872c62ba41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11206)
 
   * 7ced8cc1e89594e2a074a546a165ce3ef744841f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238904393

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   * d474c352d06287240f1b26c4bb31931eb216717c UNKNOWN
   * 1768dc61360813b1240679c5f89d037c70135855 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11207)
 
   * 3d8adcef64636820081e56b4876cf23a7efcbe53 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11210)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238902007

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   * d474c352d06287240f1b26c4bb31931eb216717c UNKNOWN
   * 1768dc61360813b1240679c5f89d037c70135855 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11207)
 
   * 3d8adcef64636820081e56b4876cf23a7efcbe53 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6615: [HUDI-4758] Add validations to java spark examples

2022-09-06 Thread GitBox


hudi-bot commented on PR #6615:
URL: https://github.com/apache/hudi/pull/6615#issuecomment-1238899373

   
   ## CI report:
   
   * 61214015c3aed029c00882f121e6ec0333767e7f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11199)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2022-09-06 Thread GitBox


xushiyan commented on code in PR #5064:
URL: https://github.com/apache/hudi/pull/5064#discussion_r964373714


##
hudi-metaserver/src/main/java/org/apache/hudi/common/table/HoodieTableMetaServerClient.java:
##
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.UserGroupInformation;
+import org.apache.hudi.common.config.HoodieMetaServerConfig;
+import org.apache.hudi.common.fs.ConsistencyGuardConfig;
+import org.apache.hudi.common.fs.FileSystemRetryConfig;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieMetaServerBasedTimeline;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metaserver.client.HoodieMetaServerClient;
+import org.apache.hudi.metaserver.client.HoodieMetaServerClientProxy;
+import org.apache.hudi.metaserver.thrift.NoSuchObjectException;
+import org.apache.hudi.metaserver.thrift.Table;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * HoodieTableMetaClient implementation for hoodie table whose metadata is 
stored in the hoodie meta server.
+ */
+public class HoodieTableMetaServerClient extends HoodieTableMetaClient {

Review Comment:
   while i agree with the idea for abstraction, I don't see it as a must-have 
for the initial version. The new interface won't impact much on the current 
impl.; and the impl. does not impact other module's code either. I think a 
follow-up refactoring is acceptable, or maybe better wait for the models in 
metaserver stabilize and then refactor with a more holistic view on these 
models.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6016: [HUDI-4465] Optimizing file-listing sequence of Metadata Table

2022-09-06 Thread GitBox


hudi-bot commented on PR #6016:
URL: https://github.com/apache/hudi/pull/6016#issuecomment-1238898719

   
   ## CI report:
   
   * 46e53b5182ffdf6fa43b5a93921222e869e4e535 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11200)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-09-06 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-4635.
---
Resolution: Fixed

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-06 Thread GitBox


nsivabalan commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1238884639

   we identified an issue w/ spark streaming where duplicate data could sneak 
into hudi w/ failures. 
   https://github.com/apache/hudi/pull/6098
   can you give it a try w/ latest master or you can apply this patch and try 
it out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6429: [HUDI-4636] Output preCombine&partition fields of delete records when changelog disabled

2022-09-06 Thread GitBox


danny0405 commented on PR #6429:
URL: https://github.com/apache/hudi/pull/6429#issuecomment-1238884544

   > We need the preCombine and partition fields also, so pull this request.
   
   Can you explain why we need this then, do you want to write to another hudi 
table using these records ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-06 Thread GitBox


boneanxs commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1238883968

   Hey, @alexeykudinkin, addressed all comments, could you plz review again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-06 Thread GitBox


danny0405 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1238882822

   > When spark use loadPartitionBucketIdFileIdMapping of 
org.apache.hudi.index.bucket.HoodieSimpleBucketIndex, it will not find the 
bucket num which written by hudi-flink
   
   Seems we should fix the code in spark side right ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-06 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238882651

   oh, I thought, both jobs are running concurrently? is it not. can you throw 
some light on exact steps. 
   is it. 
   step1: start job1 in EMR cluster1. which consumes from source X and writes 
to hudi table Y
   step2: stop job1. its essentially a batch job.
   step3: start job2 in EMR cluster2 which again consumes from source X and 
writes to hudi table Y. 
   now if you query hudi, you see duplicate data? 
   
   is my understanding right ? 
   
   also, can you share your write configs used. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6606: Observing data duplication with Single Writer

2022-09-06 Thread GitBox


nsivabalan commented on issue #6606:
URL: https://github.com/apache/hudi/issues/6606#issuecomment-1238880984

   unless you configure lock providers, hudi can't guarantee this. I would 
suggest to add locking for both writers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


voonhous commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964360302


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -62,6 +64,8 @@ public class HoodieFlinkClusteringJob {
 
   protected static final Logger LOG = 
LoggerFactory.getLogger(HoodieFlinkClusteringJob.class);
 
+  private static final String NO_EXECUTE_CALL_KEYWORD = "The application 
contains no execute() calls";
+

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


danny0405 commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964359395


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -62,6 +64,8 @@ public class HoodieFlinkClusteringJob {
 
   protected static final Logger LOG = 
LoggerFactory.getLogger(HoodieFlinkClusteringJob.class);
 
+  private static final String NO_EXECUTE_CALL_KEYWORD = "The application 
contains no execute() calls";
+

Review Comment:
   We can simplified the keyword as `no execute` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6590: [SUPPORT] HoodieDeltaStreamer AWSDmsAvroPayload fails to handle deletes in MySQL

2022-09-06 Thread GitBox


nsivabalan commented on issue #6590:
URL: https://github.com/apache/hudi/issues/6590#issuecomment-1238879167

   @codope : this is similar to the other issue you were triaging last week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-4615] Return checkpoint as null for empty data from events queue. (#6387)

2022-09-06 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d2d1cb8b28 [HUDI-4615] Return checkpoint as null for empty data from 
events queue.  (#6387)
d2d1cb8b28 is described below

commit d2d1cb8b289f43e4f467d197013f1273cd350034
Author: Vinish Reddy 
AuthorDate: Wed Sep 7 09:24:38 2022 +0530

[HUDI-4615] Return checkpoint as null for empty data from events queue.  
(#6387)


Co-authored-by: sivabalan 
---
 .../sources/helpers/S3EventsMetaSelector.java  |  4 +++-
 .../sources/helpers/TestS3EventsMetaSelector.java  | 24 ++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java
index 68ac7aba5c..13de715540 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java
@@ -153,7 +153,9 @@ public class S3EventsMetaSelector extends 
CloudObjectsSelector {
   for (Map eventRecord : eventRecords) {
 filteredEventRecords.add(new 
ObjectMapper().writeValueAsString(eventRecord).replace("%3D", "="));
   }
-  return new ImmutablePair<>(filteredEventRecords, 
String.valueOf(newCheckpointTime));
+  // Return the old checkpoint if no messages to consume from queue.
+  String newCheckpoint = newCheckpointTime == 0 ? 
lastCheckpointStr.orElse(null) : String.valueOf(newCheckpointTime);
+  return new ImmutablePair<>(filteredEventRecords, newCheckpoint);
 } catch (JSONException | IOException e) {
   throw new HoodieException("Unable to read from SQS: ", e);
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestS3EventsMetaSelector.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestS3EventsMetaSelector.java
index 2208543c08..f38e89b217 100644
--- 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestS3EventsMetaSelector.java
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/helpers/TestS3EventsMetaSelector.java
@@ -27,11 +27,14 @@ import org.apache.hudi.testutils.HoodieClientTestHarness;
 import org.apache.hudi.utilities.testutils.CloudObjectTestUtils;
 
 import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
 import com.amazonaws.services.sqs.model.Message;
 import org.apache.hadoop.fs.Path;
 import org.json.JSONObject;
 import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
 import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.ValueSource;
 import org.mockito.Mock;
@@ -43,8 +46,12 @@ import java.util.List;
 
 import static 
org.apache.hudi.utilities.sources.helpers.CloudObjectsSelector.Config.S3_SOURCE_QUEUE_REGION;
 import static 
org.apache.hudi.utilities.sources.helpers.CloudObjectsSelector.Config.S3_SOURCE_QUEUE_URL;
+import static 
org.apache.hudi.utilities.sources.helpers.CloudObjectsSelector.SQS_ATTR_APPROX_MESSAGES;
 import static 
org.apache.hudi.utilities.sources.helpers.TestCloudObjectsSelector.REGION_NAME;
 import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.when;
 
 public class TestS3EventsMetaSelector extends HoodieClientTestHarness {
 
@@ -102,4 +109,21 @@ public class TestS3EventsMetaSelector extends 
HoodieClientTestHarness {
 .getString("key"));
 assertEquals("1627376736755", eventFromQueue.getRight());
   }
+
+  @Test
+  public void testEventsFromQueueNoMessages() {
+S3EventsMetaSelector selector = new S3EventsMetaSelector(props);
+when(sqs.getQueueAttributes(any(GetQueueAttributesRequest.class)))
+.thenReturn(
+new GetQueueAttributesResult()
+.addAttributesEntry(SQS_ATTR_APPROX_MESSAGES, "0"));
+
+List processed = new ArrayList<>();
+Pair, String> eventFromQueue =
+selector.getNextEventsFromQueue(sqs, Option.empty(), processed);
+
+assertEquals(0, eventFromQueue.getLeft().size());
+assertEquals(0, processed.size());
+assertNull(eventFromQueue.getRight());
+  }
 }



[GitHub] [hudi] nsivabalan merged pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.

2022-09-06 Thread GitBox


nsivabalan merged PR #6387:
URL: https://github.com/apache/hudi/pull/6387


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ymZhao1001 commented on pull request #6366: [HUDI-4794] add an option of the log file block size

2022-09-06 Thread GitBox


ymZhao1001 commented on PR #6366:
URL: https://github.com/apache/hudi/pull/6366#issuecomment-1238877502

   > @ymZhao1001 Could you follow the process 
[here](https://hudi.apache.org/contribute/developer-setup#filing-jiras) by 
filing and claiming a Jira ticket?
   
   done 
   [https://issues.apache.org/jira/projects/HUDI/issues/HUDI-4794](url)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


voonhous commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964355713


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -335,5 +391,17 @@ public void shutdownAsyncService(boolean error) {
 public void shutDown() {
   shutdownAsyncService(false);
 }
+
+/**
+ * Execute a dummy pipeline to prevent "no execute() calls" exceptions 
from being thrown if
+ * clustering is not performed.
+ */

Review Comment:
   Updated the code for both `compactor` and `clusteringjob` to include a catch 
block like this:
   
   ```java
   try {
 compactionScheduleService.compact();
   } catch (ApplicationExecutionException aee) {
 if (aee.getMessage().contains(NO_EXECUTE_CALL_KEYWORD)) {
   LOG.info("Compaction is not performed");
 } else {
   throw aee;
 }
   } catch (Exception e) {
 LOG.error("Got error running delta sync once. Shutting down", e);
 throw e;
   } finally {
 LOG.info("Shut down hoodie flink compactor");
   }
   ```
   
   Commit encompassing this change: e1f6bec7246b564266af0959ff448cb0aa2ef972



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


voonhous commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964355713


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -335,5 +391,17 @@ public void shutdownAsyncService(boolean error) {
 public void shutDown() {
   shutdownAsyncService(false);
 }
+
+/**
+ * Execute a dummy pipeline to prevent "no execute() calls" exceptions 
from being thrown if
+ * clustering is not performed.
+ */

Review Comment:
   Updated the code for both `compactor` and `clusteringjob` to include a catch 
block like this:
   
   ```java
   try {
 compactionScheduleService.compact();
   } catch (ApplicationExecutionException aee) {
 if (aee.getMessage().contains(NO_EXECUTE_CALL_KEYWORD)) {
   LOG.info("Compaction is not performed");
 } else {
   throw aee;
 }
   } catch (Exception e) {
 LOG.error("Got error running delta sync once. Shutting down", e);
 throw e;
   } finally {
 LOG.info("Shut down hoodie flink compactor");
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


danny0405 commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964350288


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -335,5 +391,17 @@ public void shutdownAsyncService(boolean error) {
 public void shutDown() {
   shutdownAsyncService(false);
 }
+
+/**
+ * Execute a dummy pipeline to prevent "no execute() calls" exceptions 
from being thrown if
+ * clustering is not performed.
+ */

Review Comment:
   > can we add an additional catch block to catch ApplicationExecutionException
   
   We can if we can analyze the `no execute` error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238868528

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   * d474c352d06287240f1b26c4bb31931eb216717c UNKNOWN
   * 1768dc61360813b1240679c5f89d037c70135855 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11207)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6366: [HUDI-4794] add an option of the log file block size

2022-09-06 Thread GitBox


hudi-bot commented on PR #6366:
URL: https://github.com/apache/hudi/pull/6366#issuecomment-1238868304

   
   ## CI report:
   
   * 41f40900c1a22a49dce612f2684de711c6760199 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10735)
 
   * 9e8e5113a5dd1419282a3b0aa17b796b74b7f886 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11208)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


voonhous commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964348442


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -335,5 +391,17 @@ public void shutdownAsyncService(boolean error) {
 public void shutDown() {
   shutdownAsyncService(false);
 }
+
+/**
+ * Execute a dummy pipeline to prevent "no execute() calls" exceptions 
from being thrown if
+ * clustering is not performed.
+ */

Review Comment:
   While the exception throw does not affect anything (For most parts). 
   
   I would argue that it is more confusing for the it to throw an exception 
when there are no compaction/clustering plans to execute.
   
   If downstream applications are used to determine the status of the job, we 
would deem that this job has failed given that an exception is thrown (unless 
we explicitly scan the log to look for keywords to indicate that the failure is 
intended). 
   
   My goal here is to remove this such unintended errors to avoid operational 
confusion. In such a case, instead of adding an execution logic to avoid 
confusion, can we add an additional catch block to catch 
`ApplicationExecutionException`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238865999

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   * d474c352d06287240f1b26c4bb31931eb216717c UNKNOWN
   * 1768dc61360813b1240679c5f89d037c70135855 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.

2022-09-06 Thread GitBox


hudi-bot commented on PR #6387:
URL: https://github.com/apache/hudi/pull/6387#issuecomment-1238865806

   
   ## CI report:
   
   * eaf0accd1d182170e591417c2ca1ef832fae5924 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11198)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6366: [HUDI-4794] add an option of the log file block size

2022-09-06 Thread GitBox


hudi-bot commented on PR #6366:
URL: https://github.com/apache/hudi/pull/6366#issuecomment-1238865778

   
   ## CI report:
   
   * 41f40900c1a22a49dce612f2684de711c6760199 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10735)
 
   * 9e8e5113a5dd1419282a3b0aa17b796b74b7f886 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wzx140 commented on pull request #6486: [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type

2022-09-06 Thread GitBox


wzx140 commented on PR #6486:
URL: https://github.com/apache/hudi/pull/6486#issuecomment-1238863109

   @xiarixiaoyao @yihua I found UT not cover InternalSchemaChangeApplier. I 
will add some tests later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238863050

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   * d474c352d06287240f1b26c4bb31931eb216717c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


danny0405 commented on code in PR #6566:
URL: https://github.com/apache/hudi/pull/6566#discussion_r964343611


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/HoodieFlinkClusteringJob.java:
##
@@ -335,5 +391,17 @@ public void shutdownAsyncService(boolean error) {
 public void shutDown() {
   shutdownAsyncService(false);
 }
+
+/**
+ * Execute a dummy pipeline to prevent "no execute() calls" exceptions 
from being thrown if
+ * clustering is not performed.
+ */

Review Comment:
   Let's remove the dummy execution logic first to avoid confusion (a 
successful execution that does nothing), an exception throws here does not 
affect anything right ? And the user can know that there is no compaction plan 
that needs to execute.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores

2022-09-06 Thread GitBox


danny0405 commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r964341113


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.
+This would help distribute the requests evenly across different prefixes, 
resulting in Amazon S3 to create partitions for
+the prefixes each with its own request limit. This significantly reduces the 
possibility of hitting the request limit
+for a specific prefix/partition.
+
+In addition, we want to expose an interface that provides users the 
flexibility to implement their own strategy for
+distributing files if using the traditional Hive storage layout or federated 
storage layer (proposed in this RFC) does
+not meet their use-case.
+
+## Design
+
+### Interface
+
+```java
+/**
+ * Interface for providing storage file locations.
+ */
+public interface FederatedStorageStrategy extends Serializable {
+  /**
+   * Return a fully-qualified storage file location for the given filename.
+   *
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String fileName);
+
+  /**
+   * Return a fully-qualified storage file location for the given partition 
and filename.
+   *
+   * @param partitionPath partition path for the file
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String partitionPath, String fileName);
+}
+```
+
+### Generating file paths for Cloud storage optimized layout
+
+We want to distribute files evenly across multiple random prefixes, instead of 
following the traditional Hive storage
+layout of keeping them under a common table path/prefix. In addition to the 
`Table Path`, for this new layout user will
+configure another `Table Storage Path` under which the actual data files will 
be distributed. The original `Table Path` will
+be used to maintain the table/partitions Hudi metadata.
+
+For the purpose of this documentation lets assume:
+```
+Table Path => s3:
+
+Table St

[GitHub] [hudi] danny0405 commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores

2022-09-06 Thread GitBox


danny0405 commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r964340481


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.
+This would help distribute the requests evenly across different prefixes, 
resulting in Amazon S3 to create partitions for
+the prefixes each with its own request limit. This significantly reduces the 
possibility of hitting the request limit
+for a specific prefix/partition.
+
+In addition, we want to expose an interface that provides users the 
flexibility to implement their own strategy for
+distributing files if using the traditional Hive storage layout or federated 
storage layer (proposed in this RFC) does
+not meet their use-case.
+
+## Design
+
+### Interface
+
+```java
+/**
+ * Interface for providing storage file locations.
+ */
+public interface FederatedStorageStrategy extends Serializable {
+  /**
+   * Return a fully-qualified storage file location for the given filename.
+   *
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String fileName);
+
+  /**
+   * Return a fully-qualified storage file location for the given partition 
and filename.
+   *
+   * @param partitionPath partition path for the file
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String partitionPath, String fileName);
+}
+```
+
+### Generating file paths for Cloud storage optimized layout
+
+We want to distribute files evenly across multiple random prefixes, instead of 
following the traditional Hive storage
+layout of keeping them under a common table path/prefix. In addition to the 
`Table Path`, for this new layout user will
+configure another `Table Storage Path` under which the actual data files will 
be distributed. The original `Table Path` will
+be used to maintain the table/partitions Hudi metadata.
+
+For the purpose of this documentation lets assume:
+```
+Table Path => s3:
+
+Table St

[jira] [Updated] (HUDI-4794) add an option of the log file block size

2022-09-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4794:
-
Labels: pull-request-available  (was: )

> add an option of the log file block size
> 
>
> Key: HUDI-4794
> URL: https://issues.apache.org/jira/browse/HUDI-4794
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zhaoyangming
>Priority: Major
>  Labels: pull-request-available
>
> on each hoodie log append, hdfs used will be increased with the length of the 
> block(512M), not teh actual data length().
> Consider in a scenario,I use many writers to append concurrently to a large 
> number of files(bucket file),but each time I append only 10 bytes.
> dfs used will be increased with the length of the block(512M),this will cause 
> the datanode to report in-sufficient disk space on data write.
> even though it related to HDFS, We should also have the option to modify the 
> configuration.It helps reduce the rate of increase during the du.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] ymZhao1001 commented on a diff in pull request #6366: [HUDI-4794] add an option of the log file block size

2022-09-06 Thread GitBox


ymZhao1001 commented on code in PR #6366:
URL: https://github.com/apache/hudi/pull/6366#discussion_r964340073


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java:
##
@@ -230,8 +230,7 @@ private void rollOver() throws IOException {
   }
 
   private void createNewFile() throws IOException {
-this.output =
-fs.create(this.logFile.getPath(), false, bufferSize, replication, 
WriterBuilder.DEFAULT_SIZE_THRESHOLD, null);
+this.output = fs.create(this.logFile.getPath(), false, bufferSize, 
replication, HoodieCommonConfig.LOG_FILE_BLOCK_SIZE.defaultValue(), null);

Review Comment:
   thanks, you are right。 modified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4704:
--
Sprint: 2022/09/05

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4704:
-

Assignee: Raymond Xu

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4704) bulk insert overwrite table will delete the table and then recreate a table

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4704:
--
Fix Version/s: 0.12.1

> bulk insert overwrite table will delete the table and then recreate a table
> ---
>
> Key: HUDI-4704
> URL: https://issues.apache.org/jira/browse/HUDI-4704
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql, writer-core
>Affects Versions: 0.12.0
>Reporter: zouxxyy
>Priority: Major
> Fix For: 0.12.1
>
>
> When hoodie.sql.bulk.insert.enable is enabled, executing insert overwrite 
> will delete the table and then recreate a table, so time travel cannot be 
> performed.
>  
> {code:java}
> create table hudi_cow_test_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   'hoodie.sql.insert.mode' = 'non-strict',
>   'hoodie.sql.bulk.insert.enable' = 'true'
> );
> insert into hudi_cow_test_tbl select 1, 'a1', 1001, '2021-12-09', '11';
> insert overwrite hudi_cow_test_tbl select 3, 'a3', 1001, '2021-12-09', '11';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4716) Avoid bundle parquet in hadoop-mr

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4716:
--
Sprint: 2022/09/19

> Avoid bundle parquet in hadoop-mr
> -
>
> Key: HUDI-4716
> URL: https://issues.apache.org/jira/browse/HUDI-4716
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.13.0
>
>
> As per discussion in 
> https://github.com/apache/hudi/pull/5250#discussion_r930144788
> This will reduce the bundle size and uphold the principle of not bundling 
> file storage format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4720) HoodieInternalRow return wrong num of fields when source not contains meta fields

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4720.
-
  Assignee: sivabalan narayanan
Resolution: Fixed

> HoodieInternalRow return wrong num of fields when source not contains meta 
> fields
> -
>
> Key: HUDI-4720
> URL: https://issues.apache.org/jira/browse/HUDI-4720
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Frank Wong
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When source row in HoodieInternalRow not contains meta fields, 
> HoodieInternalRow#
> numFields will return size of source row. We expect it return size of source 
> row + num of meta fields



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
Sprint: 2022/09/05

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4720) HoodieInternalRow return wrong num of fields when source not contains meta fields

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4720:
--
Fix Version/s: 0.12.1

> HoodieInternalRow return wrong num of fields when source not contains meta 
> fields
> -
>
> Key: HUDI-4720
> URL: https://issues.apache.org/jira/browse/HUDI-4720
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Frank Wong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When source row in HoodieInternalRow not contains meta fields, 
> HoodieInternalRow#
> numFields will return size of source row. We expect it return size of source 
> row + num of meta fields



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
   Reviewers: sivabalan narayanan
Story Points: 1

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
Fix Version/s: 0.12.1

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
Priority: Major  (was: Minor)

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4724) add function of skip the _rt suffix for read snapshot

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4724:
--
Sprint: 2022/09/05

> add function of skip the _rt suffix for read snapshot
> -
>
> Key: HUDI-4724
> URL: https://issues.apache.org/jira/browse/HUDI-4724
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: linfey.nie
>Priority: Major
>  Labels: pull-request-available
>
> During Hive query, we usually use the original table name to write SQL. 
> Therefore, we need to skip the _rt suffix for read snapshot, the latest data 
> for calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4724) add function of skip the _rt suffix for read snapshot

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4724:
--
   Reviewers: Raymond Xu
Story Points: 1

> add function of skip the _rt suffix for read snapshot
> -
>
> Key: HUDI-4724
> URL: https://issues.apache.org/jira/browse/HUDI-4724
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: linfey.nie
>Priority: Major
>  Labels: pull-request-available
>
> During Hive query, we usually use the original table name to write SQL. 
> Therefore, we need to skip the _rt suffix for read snapshot, the latest data 
> for calculation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4734) Add table config change validation in deltastreamer

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4734:
--
Sprint: 2022/09/05

> Add table config change validation in deltastreamer
> ---
>
> Key: HUDI-4734
> URL: https://issues.apache.org/jira/browse/HUDI-4734
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: Vamshi Gudavarthi
>Priority: Major
> Fix For: 0.12.1
>
>
> looks like we are missing proper table config validation in deltastreamer. 
> {code:java}
>if (fs.exists(new Path(cfg.targetBasePath))) {
> HoodieTableMetaClient meta =
> HoodieTableMetaClient.builder().setConf(new 
> Configuration(fs.getConf())).setBasePath(cfg.targetBasePath).setLoadActiveTimelineOnLoad(false).build();
> tableType = meta.getTableType();
> // This will guarantee there is no surprise with table type
> 
> ValidationUtils.checkArgument(tableType.equals(HoodieTableType.valueOf(cfg.tableType)),
> "Hoodie table is of type " + tableType + " but passed in CLI 
> argument is " + cfg.tableType);
> // Load base file format
> // This will guarantee there is no surprise with base file type
> String baseFileFormat = 
> meta.getTableConfig().getBaseFileFormat().toString();
> 
> ValidationUtils.checkArgument(baseFileFormat.equals(cfg.baseFileFormat) || 
> cfg.baseFileFormat == null,
> "Hoodie table's base file format is of type " + baseFileFormat + 
> " but passed in CLI argument is "
> + cfg.baseFileFormat);
> cfg.baseFileFormat = baseFileFormat;
> this.cfg.baseFileFormat = baseFileFormat;
>   } else {
> tableType = HoodieTableType.valueOf(cfg.tableType);
> if (cfg.baseFileFormat == null) {
>   cfg.baseFileFormat = "PARQUET"; // default for backward 
> compatibility
> }
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4735) Spark2 bundles made from master after 2022-07-23 failed to stop

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4735:
--
Sprint: 2022/09/05

> Spark2 bundles made from master after 2022-07-23 failed to stop
> ---
>
> Key: HUDI-4735
> URL: https://issues.apache.org/jira/browse/HUDI-4735
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, spark, spark-sql
>Reporter: Vitali Obukh
>Assignee: Raymond Xu
>Priority: Major
>
> Daily Spark2 Structured streaming application built against the latest change 
> on master started failing on 2022-07-23 with the error:
> {quote}
> 22/07/28 10:52:50 ERROR org.apache.hudi.HoodieStreamingSink: Micro batch id=2 
> threw following exception: 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'test' not found;
> {quote}
> Actually the database exists (created by the Hudi). Spark3 application with 
> the same code works fine.
> Hudi options used for writing:
> {code}
>  Map("hoodie.datasource.write.operation" -> "upsert",
>   "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
>   "hoodie.datasource.write.recordkey.field" -> s"$Meta.$Key",
>   "hoodie.datasource.write.precombine.field" -> s"$Meta.$Version",
>   "hoodie.datasource.write.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.compaction.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.datasource.write.partitionpath.field" -> TopLevelPartition,
>   "hoodie.datasource.write.hive_style_partitioning" -> "true",
>   "hoodie.datasource.hive_sync.enable" -> "true",
>   "hoodie.datasource.hive_sync.mode" -> "hms",
>   "hoodie.datasource.hive_sync.database" -> dbName(),
>   "hoodie.datasource.hive_sync.partition_fields" -> TopLevelPartition,
>   "hoodie.datasource.compaction.async.enable" -> "false"
> )
> {code}
> Cloud platform - Google Dataproc. 
> Spark version - 3.1.3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4735) Spark2 bundles made from master after 2022-07-23 failed to stop

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4735:
--
Fix Version/s: 0.12.1

> Spark2 bundles made from master after 2022-07-23 failed to stop
> ---
>
> Key: HUDI-4735
> URL: https://issues.apache.org/jira/browse/HUDI-4735
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive, spark, spark-sql
>Reporter: Vitali Obukh
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> Daily Spark2 Structured streaming application built against the latest change 
> on master started failing on 2022-07-23 with the error:
> {quote}
> 22/07/28 10:52:50 ERROR org.apache.hudi.HoodieStreamingSink: Micro batch id=2 
> threw following exception: 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 
> 'test' not found;
> {quote}
> Actually the database exists (created by the Hudi). Spark3 application with 
> the same code works fine.
> Hudi options used for writing:
> {code}
>  Map("hoodie.datasource.write.operation" -> "upsert",
>   "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
>   "hoodie.datasource.write.recordkey.field" -> s"$Meta.$Key",
>   "hoodie.datasource.write.precombine.field" -> s"$Meta.$Version",
>   "hoodie.datasource.write.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.compaction.payload.class" -> 
> "com.walmart.udps.hudi.integration.HudiTestAvroPayload",
>   "hoodie.datasource.write.partitionpath.field" -> TopLevelPartition,
>   "hoodie.datasource.write.hive_style_partitioning" -> "true",
>   "hoodie.datasource.hive_sync.enable" -> "true",
>   "hoodie.datasource.hive_sync.mode" -> "hms",
>   "hoodie.datasource.hive_sync.database" -> dbName(),
>   "hoodie.datasource.hive_sync.partition_fields" -> TopLevelPartition,
>   "hoodie.datasource.compaction.async.enable" -> "false"
> )
> {code}
> Cloud platform - Google Dataproc. 
> Spark version - 3.1.3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] Aload opened a new issue, #6618: Caused by: org.apache.http.NoHttpResponseException: xxxxxx:34812 failed to respond[SUPPORT]

2022-09-06 Thread GitBox


Aload opened a new issue, #6618:
URL: https://github.com/apache/hudi/issues/6618

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   When the program has been running for a while, the following problems 
repeatedly occur.
   
![image](https://user-images.githubusercontent.com/13082598/188775813-2fdc6ab0-189f-4a36-9aa7-50957e9f594b.png)
   
   
   **To Reproduce**
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * Spark version :3.2.1
   
   * Hive version :2.3.7
   
   * Hadoop version :3.0.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```2022-09-07 06:54:16,271 ERROR 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView [] - Got error 
running preferred function. Trying secondary
   org.apache.hudi.exception.HoodieRemoteException: 10.0.20.51:34812 failed to 
respond
at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:438)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:68)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getPendingCompactionOperations(PriorityBasedFileSystemView.java:224)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.scheduleCompaction(ScheduleCompactionActionExecutor.java:117)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute(ScheduleCompactionActionExecutor.java:93)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.HoodieFlinkMergeOnReadTable.scheduleCompaction(HoodieFlinkMergeOnReadTable.java:109)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableServiceInternal(BaseHoodieWriteClient.java:1353)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableService(BaseHoodieWriteClient.java:1330)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompactionAtInstant(BaseHoodieWriteClient.java:1009)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompaction(BaseHoodieWriteClient.java:1000)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.util.CompactionUtil.scheduleCompaction(CompactionUtil.java:65) 
~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:246)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
 ~[hudi-flink1.14-bundle-0.12.0.jar:0.12.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_181]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
   Caused by: org.apache.http.NoHttpResponseException: 10.0.20.51:34812 failed 
to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167) 
~[anso-process-0.0.1.jar:?]
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
 ~[anso-process-0.0.1.jar:?]
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271) 
~[anso-pr

[jira] [Updated] (HUDI-4762) Hive sync update schema removes columns

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4762:
--
Sprint: 2022/09/05

> Hive sync update schema removes columns 
> 
>
> Key: HUDI-4762
> URL: https://issues.apache.org/jira/browse/HUDI-4762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when move a hudi table from schema1 to schema2 and then insert data 
> with the old schema1, then schema 2 is kept for the whole table.
> This is not consistent with hive metastore which get its schema updated to 
> the old schema1.
> Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4762) Hive sync update schema removes columns

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4762:
--
Component/s: meta-sync

> Hive sync update schema removes columns 
> 
>
> Key: HUDI-4762
> URL: https://issues.apache.org/jira/browse/HUDI-4762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when move a hudi table from schema1 to schema2 and then insert data 
> with the old schema1, then schema 2 is kept for the whole table.
> This is not consistent with hive metastore which get its schema updated to 
> the old schema1.
> Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4762) Hive sync update schema removes columns

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4762:
--
   Reviewers: Raymond Xu
Story Points: 1

> Hive sync update schema removes columns 
> 
>
> Key: HUDI-4762
> URL: https://issues.apache.org/jira/browse/HUDI-4762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when move a hudi table from schema1 to schema2 and then insert data 
> with the old schema1, then schema 2 is kept for the whole table.
> This is not consistent with hive metastore which get its schema updated to 
> the old schema1.
> Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4762) Hive sync update schema removes columns

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4762:
--
Priority: Critical  (was: Major)

> Hive sync update schema removes columns 
> 
>
> Key: HUDI-4762
> URL: https://issues.apache.org/jira/browse/HUDI-4762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: nicolas paris
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when move a hudi table from schema1 to schema2 and then insert data 
> with the old schema1, then schema 2 is kept for the whole table.
> This is not consistent with hive metastore which get its schema updated to 
> the old schema1.
> Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4762) Hive sync update schema removes columns

2022-09-06 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4762:
--
Fix Version/s: 0.12.1

> Hive sync update schema removes columns 
> 
>
> Key: HUDI-4762
> URL: https://issues.apache.org/jira/browse/HUDI-4762
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: nicolas paris
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when move a hudi table from schema1 to schema2 and then insert data 
> with the old schema1, then schema 2 is kept for the whole table.
> This is not consistent with hive metastore which get its schema updated to 
> the old schema1.
> Avoid update metastore schema if only missing column in input



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #5716: [HUDI-4167] Remove the timeline refresh with initializing hoodie table

2022-09-06 Thread GitBox


danny0405 commented on code in PR #5716:
URL: https://github.com/apache/hudi/pull/5716#discussion_r931721763


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -567,6 +564,16 @@ private synchronized void close(Pair 
partitionFileSlicePair) {
 closeReader(readers);
   }
 
+  /**
+   * Close and clear all the partitions readers.
+   */
+  private void closePartitionReaders() {

Review Comment:
   No, it is related, because in this patch, we refresh the metadata table when 
the fs view is refreshed, and there are some states in the readers we need to 
clean.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-06 Thread GitBox


hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1238837321

   
   ## CI report:
   
   * 277061fa910ff388b9fa580083fd3af406ce3b94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11160)
 
   * 86efca51356b9d673dc971b23e8280872c62ba41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11206)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6566: [HUDI-4766] Fix HoodieFlinkClusteringJob

2022-09-06 Thread GitBox


hudi-bot commented on PR #6566:
URL: https://github.com/apache/hudi/pull/6566#issuecomment-1238837297

   
   ## CI report:
   
   * b10c9d062f03c2c2675866c6f4bf6346dc03ea49 UNKNOWN
   * a2dcd81f74603e88c4db895900d43eee6702a6da UNKNOWN
   * c404647afc6d26bc0e69a7a8ef93f378b397bb96 UNKNOWN
   * f54c41a18ee1070249aafd9c81830699b360d0b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11154)
 
   * 38b7cfffeb455c6745b7d23445e48292567e7b9a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eric9204 commented on a diff in pull request #6574: Keep a clustering running at the same time.#6573

2022-09-06 Thread GitBox


eric9204 commented on code in PR #6574:
URL: https://github.com/apache/hudi/pull/6574#discussion_r963258228


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanActionExecutor.java:
##
@@ -77,11 +78,14 @@ protected Option 
createClusteringPlan() {
   return Option.empty();
 }
 
-LOG.info("Generating clustering plan for table " + config.getBasePath());
-ClusteringPlanStrategy strategy = (ClusteringPlanStrategy)
-
ReflectionUtils.loadClass(ClusteringPlanStrategy.checkAndGetClusteringPlanStrategy(config),
 table, context, config);
+ClusteringPlanStrategy strategy = null;
+if (config.getAsyncClusterMaxCommits() <= commitsSinceLastClustering) {
+  LOG.info("Generating clustering plan for table " + config.getBasePath());
+  strategy = (ClusteringPlanStrategy)
+  
ReflectionUtils.loadClass(ClusteringPlanStrategy.checkAndGetClusteringPlanStrategy(config),
 table, context, config);
+}
 
-return strategy.generateClusteringPlan();
+return strategy == null ? Option.empty() : 
strategy.generateClusteringPlan();

Review Comment:
   @yihua yes, because my last commit deleted the following condition,which can 
avoid the frequent clustering scheduling.
   ```
   if (table.getActiveTimeline().filterPendingReplaceTimeline().countInstants() 
!= 0) {
 LOG.info("The last clustering is running,there is no need to generate 
a new clustering plan" + config.getBasePath());
 return Option.empty();
   }
   ```



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/cluster/ClusteringPlanActionExecutor.java:
##
@@ -63,6 +63,7 @@ protected Option createClusteringPlan() 
{
 int commitsSinceLastClustering = 
table.getActiveTimeline().getCommitsTimeline().filterCompletedInstants()
 
.findInstantsAfter(lastClusteringInstant.map(HoodieInstant::getTimestamp).orElse("0"),
 Integer.MAX_VALUE)
 .countInstants();
+
 if (config.inlineClusteringEnabled() && 
config.getInlineClusterMaxCommits() > commitsSinceLastClustering) {

Review Comment:
   @yihua yes, this is indeed a redundant inspection, I'm testing whether this 
condition is needed.
   
   By adding these two conditions, it can really be guaranteed that only one 
clustering is running at the same time, and if there is no completed 
clustering, no new clustering plan will be generated.
   
   
   Configure only these three parameters. 
   
   'clustering.schedule.enabled'='true',
   'clustering.async.enabled'='false',
   'clustering.delta_commits'='6',
   
   ```
   0 2022-09-02 10:38 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/.aux
   0 2022-09-02 10:38 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/.schema
   0 2022-09-02 10:41 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/.temp
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103807454.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103807454.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103807454.inflight
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103813399.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103813399.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103813399.inflight
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103823232.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103823232.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103823232.inflight
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103833587.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103833587.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103833587.inflight
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103842538.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103842538.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103842538.inflight
   2.6 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103856152.commit
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103856152.commit.requested
   0 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103856152.inflight
   3.8 K 2022-09-02 10:39 
/tmp/hudi/insert_cow_clustering_12_state_1/.hoodie/20220902103902693.replacecommit.requested
   2.6 K 2022-09-02 10:40 
/tmp/hudi/insert_cow_clustering_12_state_

[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-06 Thread GitBox


hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1238834637

   
   ## CI report:
   
   * 277061fa910ff388b9fa580083fd3af406ce3b94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11160)
 
   * 86efca51356b9d673dc971b23e8280872c62ba41 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6607: [HUDI-4782] Support TIMESTAMP_LTZ type for flink

2022-09-06 Thread GitBox


hudi-bot commented on PR #6607:
URL: https://github.com/apache/hudi/pull/6607#issuecomment-1238831881

   
   ## CI report:
   
   * e05038ec2798a39ce3ab7bcbdbcf9c7e009c8188 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11180)
 
   * 226953816abfedf8df546374fa35290950ab757c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11205)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed pull request #2419: [HUDI-1421] Improvement of failure recovery for HoodieFlinkStreamer.

2022-09-06 Thread GitBox


danny0405 closed pull request #2419: [HUDI-1421] Improvement of failure 
recovery for HoodieFlinkStreamer.
URL: https://github.com/apache/hudi/pull/2419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed issue #6540: [SUPPORT]KryoException when bulk insert into hudi with flink

2022-09-06 Thread GitBox


danny0405 closed issue #6540: [SUPPORT]KryoException when bulk insert into hudi 
with flink
URL: https://github.com/apache/hudi/issues/6540


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #6540: [SUPPORT]KryoException when bulk insert into hudi with flink

2022-09-06 Thread GitBox


danny0405 commented on issue #6540:
URL: https://github.com/apache/hudi/issues/6540#issuecomment-1238828954

   Thanks, the problem expects to be fixed in #6571, feel free to reopen it if 
the problem still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6607: [HUDI-4782] Support TIMESTAMP_LTZ type for flink

2022-09-06 Thread GitBox


hudi-bot commented on PR #6607:
URL: https://github.com/apache/hudi/pull/6607#issuecomment-1238828680

   
   ## CI report:
   
   * e05038ec2798a39ce3ab7bcbdbcf9c7e009c8188 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11180)
 
   * 226953816abfedf8df546374fa35290950ab757c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-4795) Fix KryoException when bulk insert into a not bucket index hudi table

2022-09-06 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4795.
--

> Fix KryoException when bulk insert into a not bucket index hudi table
> -
>
> Key: HUDI-4795
> URL: https://issues.apache.org/jira/browse/HUDI-4795
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-4795) Fix KryoException when bulk insert into a not bucket index hudi table

2022-09-06 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601102#comment-17601102
 ] 

Danny Chen commented on HUDI-4795:
--

Fixed via master branch: 27c7efb4efc380360af7a18fc57c0757f852390f

> Fix KryoException when bulk insert into a not bucket index hudi table
> -
>
> Key: HUDI-4795
> URL: https://issues.apache.org/jira/browse/HUDI-4795
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (323f19685c -> 27c7efb4ef)

2022-09-06 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 323f19685c [HUDI-4776] Fix merge into use unresolved assignment (#6589)
 add 27c7efb4ef [HUDI-4795] Fix KryoException when bulk insert into a not 
bucket index hudi table

No new revisions were added by this update.

Summary of changes:
 .../hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[jira] [Updated] (HUDI-4795) Fix KryoException when bulk insert into a not bucket index hudi table

2022-09-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4795:
-
Labels: pull-request-available  (was: )

> Fix KryoException when bulk insert into a not bucket index hudi table
> -
>
> Key: HUDI-4795
> URL: https://issues.apache.org/jira/browse/HUDI-4795
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4795) Fix KryoException when bulk insert into a not bucket index hudi table

2022-09-06 Thread Danny Chen (Jira)
Danny Chen created HUDI-4795:


 Summary: Fix KryoException when bulk insert into a not bucket 
index hudi table
 Key: HUDI-4795
 URL: https://issues.apache.org/jira/browse/HUDI-4795
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink
Reporter: Danny Chen
 Fix For: 0.12.1






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 merged pull request #6571: [HUDI-4795] Fix KryoException when bulk insert into a not bucket index hudi table

2022-09-06 Thread GitBox


danny0405 merged PR #6571:
URL: https://github.com/apache/hudi/pull/6571


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6612: [HUDI-4790] a more effective HoodieMergeHandler for COW table with parquet

2022-09-06 Thread GitBox


danny0405 commented on PR #6612:
URL: https://github.com/apache/hudi/pull/6612#issuecomment-1238824801

   Overall an interesting idea, let put the details in the document.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-06 Thread GitBox


paul8263 commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1238819752

   Hi @codope and @yihua ,
   Errors of hudi-integ-test are almost cleared. The only one left is:
   
   
org.apache.hudi.integ.command.ITTestHoodieSyncCommand.testValidateSync(ITTestHoodieSyncCommand.java:56)
   
   
[https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11183&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a&l=146906](https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11183&view=logs&j=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de&t=30b5aae4-0ea0-5566-42d0-febf71a7061a&l=146906)
   
   Is there a way to view the detailed error log in the docker container via 
Azure?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hbgstc123 commented on issue #6540: [SUPPORT]KryoException when bulk insert into hudi with flink

2022-09-06 Thread GitBox


hbgstc123 commented on issue #6540:
URL: https://github.com/apache/hudi/issues/6540#issuecomment-1238816669

   > Do you use streaming mode? When I use streaming mode and use bounded 
source, this error will be reproduced stably.
   Both streaming mode and batch mode can reproduce stably
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua closed pull request #3010: Improving Hudi CLI tool docs

2022-09-06 Thread GitBox


yihua closed pull request #3010: Improving Hudi CLI tool docs
URL: https://github.com/apache/hudi/pull/3010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #3010: Improving Hudi CLI tool docs

2022-09-06 Thread GitBox


yihua commented on PR #3010:
URL: https://github.com/apache/hudi/pull/3010#issuecomment-1238815146

   Closing this as the current [OSS Hudi CLI 
guide](https://hudi.apache.org/docs/cli) and [EMR Hudi CLI 
guide](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-cli.html) 
are comprehensive.  @pratyakshsharma feel free to open a new PR to add 
information based on the latest docs if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #2963: [HUDI-1904] Introduce SchemaProviderInterface to make SchemaProvider unified

2022-09-06 Thread GitBox


yihua commented on PR #2963:
URL: https://github.com/apache/hudi/pull/2963#issuecomment-1238811686

   @wangxianghu do we still need this or can we close it now, given schema on 
read / evolution is supported in Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4794) add an option of the log file block size

2022-09-06 Thread zhaoyangming (Jira)
zhaoyangming created HUDI-4794:
--

 Summary: add an option of the log file block size
 Key: HUDI-4794
 URL: https://issues.apache.org/jira/browse/HUDI-4794
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: zhaoyangming


on each hoodie log append, hdfs used will be increased with the length of the 
block(512M), not teh actual data length().
Consider in a scenario,I use many writers to append concurrently to a large 
number of files(bucket file),but each time I append only 10 bytes.
dfs used will be increased with the length of the block(512M),this will cause 
the datanode to report in-sufficient disk space on data write.
even though it related to HDFS, We should also have the option to modify the 
configuration.It helps reduce the rate of increase during the du.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   >