date:20230711

[GitHub] [hudi] danny0405 opened a new pull request, #9177: [HUDI-6526] Hoodie Hive catalog sync timestamp(3) as timestamp type into Hive

2023-07-11 Thread via GitHub



danny0405 opened a new pull request, #9177:
URL: https://github.com/apache/hudi/pull/9177

   ### Change Logs
   
   After HUDI-6307, timestamp(3) is also synced as timestamp type in Hive.
   
   ### Impact
   
   The table created through hive catalog would sync the timestamp(3) as 
timestamp in Hive.
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6526) Hoodie Hive catalog sync timestamp(3) as timestamp type into Hive

2023-07-11 Thread Danny Chen (Jira)

Danny Chen created HUDI-6526:


 Summary: Hoodie Hive catalog sync timestamp(3) as timestamp type 
into Hive
 Key: HUDI-6526
 URL: https://issues.apache.org/jira/browse/HUDI-6526
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink-sql
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled

2023-07-11 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-6525.

Resolution: Invalid

> Syncing the view in Embedded TimelineServer can cause NullPointerException if 
> MDT is enabled
> 
>
> Key: HUDI-6525
> URL: https://issues.apache.org/jira/browse/HUDI-6525
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: 0.14.0
> Fix For: 0.14.0
>
>
> Assume Embedded TimelineServer is being used and there are many executors 
> sending requests.
> Executor 1: Sends request which causes TimelineServer to sync its view
>      syncIfLocalViewBehind() -> view.sync();
> If the view is HoodieMetadataFileSystemView, its sync is written as follows:
>  
> public void sync() {
>    super.sync(); // -> REFRESHES TIMELINE ON TImelineServer 
>    tableMetadata.reset(); // CLOSES MDT readers and OPENS them again
> }
>  
> The issue is that once super.sync() completes (almost immediately), the 
> requests from other executors wont detect the local view to be behind and 
> hence would not call syncIfLocalViewBehind(). They would directly start 
> calling functions on the view (e.g. view.getLatestFileSlices()). But until 
> MDT is read (log merging may take time when there are large number of log 
> blocks), view.getLatestFileSlices() will lead to NPE.
>  
> The fix is to reverse the order of reset and sync in 
> HoodieMetadataFileSystemView.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled

2023-07-11 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742289#comment-17742289
 ] 

Prashant Wason commented on HUDI-6525:
--

This is not an issue for master branch due to read/write lock implemented in 
[https://github.com/apache/hudi/pull/8079.]

I detected this issue in 0.10 build.

> Syncing the view in Embedded TimelineServer can cause NullPointerException if 
> MDT is enabled
> 
>
> Key: HUDI-6525
> URL: https://issues.apache.org/jira/browse/HUDI-6525
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: 0.14.0
> Fix For: 0.14.0
>
>
> Assume Embedded TimelineServer is being used and there are many executors 
> sending requests.
> Executor 1: Sends request which causes TimelineServer to sync its view
>      syncIfLocalViewBehind() -> view.sync();
> If the view is HoodieMetadataFileSystemView, its sync is written as follows:
>  
> public void sync() {
>    super.sync(); // -> REFRESHES TIMELINE ON TImelineServer 
>    tableMetadata.reset(); // CLOSES MDT readers and OPENS them again
> }
>  
> The issue is that once super.sync() completes (almost immediately), the 
> requests from other executors wont detect the local view to be behind and 
> hence would not call syncIfLocalViewBehind(). They would directly start 
> calling functions on the view (e.g. view.getLatestFileSlices()). But until 
> MDT is read (log merging may take time when there are large number of log 
> blocks), view.getLatestFileSlices() will lead to NPE.
>  
> The fix is to reverse the order of reset and sync in 
> HoodieMetadataFileSystemView.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631919853

   
   ## CI report:
   
   * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500)
 
   * 16ae34ec0e91811bae11a980749f5b77d048adba Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18519)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6525) Syncing the view in Embedded TimelineServer can cause NullPointerException if MDT is enabled

2023-07-11 Thread Prashant Wason (Jira)

Prashant Wason created HUDI-6525:


 Summary: Syncing the view in Embedded TimelineServer can cause 
NullPointerException if MDT is enabled
 Key: HUDI-6525
 URL: https://issues.apache.org/jira/browse/HUDI-6525
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Prashant Wason
Assignee: Prashant Wason
 Fix For: 0.14.0


Assume Embedded TimelineServer is being used and there are many executors 
sending requests.

Executor 1: Sends request which causes TimelineServer to sync its view

     syncIfLocalViewBehind() -> view.sync();

If the view is HoodieMetadataFileSystemView, its sync is written as follows:

 
public void sync() {
   super.sync(); // -> REFRESHES TIMELINE ON TImelineServer 
   tableMetadata.reset(); // CLOSES MDT readers and OPENS them again
}
 
The issue is that once super.sync() completes (almost immediately), the 
requests from other executors wont detect the local view to be behind and hence 
would not call syncIfLocalViewBehind(). They would directly start calling 
functions on the view (e.g. view.getLatestFileSlices()). But until MDT is read 
(log merging may take time when there are large number of log blocks), 
view.getLatestFileSlices() will lead to NPE.
 
The fix is to reverse the order of reset and sync in 
HoodieMetadataFileSystemView.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631910270

   
   ## CI report:
   
   * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500)
 
   * 16ae34ec0e91811bae11a980749f5b77d048adba UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1631909019

   
   ## CI report:
   
   * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN
   * 8429f89c5499d33074a6ed0039e5bd9a2da84024 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18483)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18518)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on pull request #8452: [HUDI-6077] Add more partition push down filters

2023-07-11 Thread via GitHub



boneanxs commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1631902798

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope merged pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



codope merged PR #9168:
URL: https://github.com/apache/hudi/pull/9168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1631868486

   
   ## CI report:
   
   * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN
   * cb101756f1bb906839b8f135b618f26205e022a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18462)
 
   * 8ea8e3a853152fcc0d1cf4d0f53a38565ac33990 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18515)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631868189

   
   ## CI report:
   
   * 66acd28baa87819b949d165775556169243ccc3f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511)
 
   * a0e53d8ef3f506afca5f25fd22adb70581fae426 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18516)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close

2023-07-11 Thread via GitHub



SteNicholas commented on code in PR #9160:
URL: https://github.com/apache/hudi/pull/9160#discussion_r1260602587


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java:
##
@@ -185,36 +179,55 @@ public void start(String instantTime) {
   }
 
   /**
-   * Stops the heartbeat for the specified instant.
-   * @param instantTime
+   * Stops the heartbeat and deletes the heartbeat file for the specified 
instant.
+   *
+   * @param instantTime The instant time for the heartbeat.
* @throws HoodieException
*/
   public void stop(String instantTime) throws HoodieException {
 Heartbeat heartbeat = instantToHeartbeatMap.get(instantTime);
-if (heartbeat != null && heartbeat.isHeartbeatStarted() && 
!heartbeat.isHeartbeatStopped()) {
-  LOG.info("Stopping heartbeat for instant " + instantTime);
-  heartbeat.getTimer().cancel();
-  heartbeat.setHeartbeatStopped(true);
-  LOG.info("Stopped heartbeat for instant " + instantTime);
+if (isHeartbeatStarted(heartbeat)) {
+  stopHeartbeatTimer(heartbeat);
   HeartbeatUtils.deleteHeartbeatFile(fs, basePath, instantTime);
   LOG.info("Deleted heartbeat file for instant " + instantTime);
 }
   }
 
   /**
-   * Stops all heartbeats started via this instance of the client.
+   * Stops all timers of heartbeats started via this instance of the client.
+   *
* @throws HoodieException
*/
-  public void stop() throws HoodieException {
-instantToHeartbeatMap.values().forEach(heartbeat -> 
stop(heartbeat.getInstantTime()));
+  public void stopHeartbeatTimers() throws HoodieException {
+
instantToHeartbeatMap.values().stream().filter(this::isHeartbeatStarted).forEach(this::stopHeartbeatTimer);
+  }
+
+  /**
+   * Whether the given heartbeat is started.
+   *
+   * @param heartbeat The heartbeat to check whether is started.
+   * @return Whether the heartbeat is started.
+   * @throws IOException
+   */
+  private boolean isHeartbeatStarted(Heartbeat heartbeat) {
+return heartbeat != null && heartbeat.isHeartbeatStarted() && 
!heartbeat.isHeartbeatStopped();

Review Comment:
   @leesf, `heartbeat != null && heartbeat.isHeartbeatStarted()` or `heartbeat 
!= null && heartbeat.isHeartbeatStopped()` is not enough. Because 
`isHeartbeatStarted` is true and `isHeartbeatStopped` is true when heartbeat is 
stopped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1631862722

   
   ## CI report:
   
   * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN
   * cb101756f1bb906839b8f135b618f26205e022a9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18462)
 
   * 8ea8e3a853152fcc0d1cf4d0f53a38565ac33990 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631862350

   
   ## CI report:
   
   * 66acd28baa87819b949d165775556169243ccc3f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511)
 
   * a0e53d8ef3f506afca5f25fd22adb70581fae426 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9145:
URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631856861

   
   ## CI report:
   
   * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18506)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9131:
URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631856765

   
   ## CI report:
   
   * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18505)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260558466


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] hudi-bot commented on pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9175:
URL: https://github.com/apache/hudi/pull/9175#issuecomment-1631826023

   
   ## CI report:
   
   * 98654ed70f92bcbe239b999e733b7aa79c296faf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18513)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9176: [HUDI-6523] fix get valid checkpoint for current writer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9176:
URL: https://github.com/apache/hudi/pull/9176#issuecomment-1631826058

   
   ## CI report:
   
   * 4e765c59cbf491b4a593e77b0951d9ab74b516bd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18514)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631825658

   
   ## CI report:
   
   * 66acd28baa87819b949d165775556169243ccc3f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8610: [HUDI-6156] Prevent leaving tmp file in timeline when multi process t…

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8610:
URL: https://github.com/apache/hudi/pull/8610#issuecomment-1631825131

   
   ## CI report:
   
   * f34ffd6ccf4fd366ade5dad8487ff9a0a248bec8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16758)
 
   * 0bd8531ac13126771f84db19fa02fbe1828b762d UNKNOWN
   * bfc9450676a0742bad6e25ee1e56aa240e31fe74 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18512)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9176: [HUDI-6523] fix get valid checkpoint for current writer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9176:
URL: https://github.com/apache/hudi/pull/9176#issuecomment-1631820866

   
   ## CI report:
   
   * 4e765c59cbf491b4a593e77b0951d9ab74b516bd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9175:
URL: https://github.com/apache/hudi/pull/9175#issuecomment-1631820842

   
   ## CI report:
   
   * 98654ed70f92bcbe239b999e733b7aa79c296faf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8610: [HUDI-6156] Prevent leaving tmp file in timeline when multi process t…

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8610:
URL: https://github.com/apache/hudi/pull/8610#issuecomment-1631819960

   
   ## CI report:
   
   * f34ffd6ccf4fd366ade5dad8487ff9a0a248bec8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16758)
 
   * 0bd8531ac13126771f84db19fa02fbe1828b762d UNKNOWN
   * bfc9450676a0742bad6e25ee1e56aa240e31fe74 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9173:
URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631815699

   
   ## CI report:
   
   * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631815308

   
   ## CI report:
   
   * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509)
 
   * 66acd28baa87819b949d165775556169243ccc3f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18511)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] voonhous commented on pull request #9068: [HUDI-6447] Unify operation in compaction and clustering procedure

2023-07-11 Thread via GitHub



voonhous commented on PR #9068:
URL: https://github.com/apache/hudi/pull/9068#issuecomment-1631813308

   @Zouxxyy Do you mean:
   
   > For example, ~compaction~ clustering procedure only supports schedule and 
run (do both schedule and execute), execute only is not supported,...
   
   Got a little confused I read the description.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…

2023-07-11 Thread via GitHub



SteNicholas commented on code in PR #9175:
URL: https://github.com/apache/hudi/pull/9175#discussion_r1260538162


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -362,6 +362,10 @@ private void initClientIds(Configuration conf) {
 
   private void reset() {
 this.eventBuffer = new WriteMetadataEvent[this.parallelism];
+if (!this.instant.equals(WriteMetadataEvent.BOOTSTRAP_INSTANT)) {
+  writeClient.getHeartbeatClient().stop(instant);

Review Comment:
   This should check whether the failed writes clean policy is lazy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6523) Fix get valid checkpoint for current writer

2023-07-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6523:
-
Labels: pull-request-available  (was: )

> Fix get valid checkpoint for current writer
> ---
>
> Key: HUDI-6523
> URL: https://issues.apache.org/jira/browse/HUDI-6523
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: eric
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> 23/07/11 16:50:57 INFO HoodieCompactor: Compactor compacting 
> [CompactionOperation{baseInstantTime='20230711165027926', 
> dataFileCommitTime=Option{val=20230711165027926}, 
> deltaFileNames=[.0001-2bae-40d8-8038-401eefb9e7e3-0_20230711165027926.log.1_1-53-670],
>  
> dataFileName=Option{val=0001-2bae-40d8-8038-401eefb9e7e3-0_1-27-432_20230711165027926.parquet},
>  id='HoodieFileGroupId{partitionPath='part=2023071116', 
> fileId='0001-2bae-40d8-8038-401eefb9e7e3-0'}', 
> metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
> TOTAL_LOG_FILES_SIZE=3976.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
> bootstrapFilePath=Optional.empty}, 
> CompactionOperation{baseInstantTime='20230711165027926', 
> dataFileCommitTime=Option{val=20230711165027926}, 
> deltaFileNames=[.-eb61-4788-a9cb-aaa67e2e47c4-0_20230711165027926.log.1_0-53-671],
>  
> dataFileName=Option{val=-eb61-4788-a9cb-aaa67e2e47c4-0_0-27-431_20230711165027926.parquet},
>  id='HoodieFileGroupId{partitionPath='part=2023071116', 
> fileId='-eb61-4788-a9cb-aaa67e2e47c4-0'}', 
> metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
> TOTAL_LOG_FILES_SIZE=3592.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
> bootstrapFilePath=Optional.empty}, 
> CompactionOperation{baseInstantTime='20230711165027926', 
> dataFileCommitTime=Option{val=20230711165027926}, 
> deltaFileNames=[.0002-7160-4515-a0a6-7bcf2e03-0_20230711165027926.log.1_2-53-673],
>  
> dataFileName=Option{val=0002-7160-4515-a0a6-7bcf2e03-0_2-27-433_20230711165027926.parquet},
>  id='HoodieFileGroupId{partitionPath='part=2023071116', 
> fileId='0002-7160-4515-a0a6-7bcf2e03-0'}', 
> metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
> TOTAL_LOG_FILES_SIZE=3591.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
> bootstrapFilePath=Optional.empty}, 
> CompactionOperation{baseInstantTime='20230711165027926', 
> dataFileCommitTime=Option{val=20230711165027926}, 
> deltaFileNames=[.0003-5a31-411f-8430-ccf4bec128e8-0_20230711165027926.log.1_3-53-672],
>  
> dataFileName=Option{val=0003-5a31-411f-8430-ccf4bec128e8-0_3-27-434_20230711165027926.parquet},
>  id='HoodieFileGroupId{partitionPath='part=2023071116', 
> fileId='0003-5a31-411f-8430-ccf4bec128e8-0'}', 
> metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
> TOTAL_LOG_FILES_SIZE=3207.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
> bootstrapFilePath=Optional.empty}] files
> 23/07/11 16:50:57 ERROR MicroBatchExecution: Query RateStreamSource [id = 
> 44581078-04ee-48ae-bc74-143b3c836a23, runId = 
> 916ce5a4-bd8a-4010-8c0e-869c58db41ab] terminated with error
> org.apache.hudi.exception.HoodieIOException: Failed to parse 
> HoodieCommitMetadata for [==>20230711165055791__compaction__REQUESTED]
> at 
> org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$1(CommitUtils.java:173)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:356)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:500)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464)
> at 
> org.apache.hudi.common.util.CommitUtils.getValidCheckpointForCurrentWriter(CommitUtils.java:175)
> at 
> org.apache.hudi.HoodieStreamingSink.canSkipBatch(HoodieStreamingSink.scala:313)
> at 
> org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:104)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:665)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
> at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionI

[GitHub] [hudi] eric9204 opened a new pull request, #9176: [HUDI-6523] fix get valid checkpoint for current writer

2023-07-11 Thread via GitHub



eric9204 opened a new pull request, #9176:
URL: https://github.com/apache/hudi/pull/9176

   ### Change Logs
   
   filter commit action and deltacommit action when get valid checkpoint for 
current writer
   
   ### Impact
   
   NONE
   
   ### Risk level (write none, low medium or high below)
   
   NONE
   
   ### Documentation Update
   
   NONE
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on a diff in pull request #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…

2023-07-11 Thread via GitHub



stream2000 commented on code in PR #9175:
URL: https://github.com/apache/hudi/pull/9175#discussion_r1260529173


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java:
##
@@ -362,6 +362,10 @@ private void initClientIds(Configuration conf) {
 
   private void reset() {
 this.eventBuffer = new WriteMetadataEvent[this.parallelism];
+if (!this.instant.equals(WriteMetadataEvent.BOOTSTRAP_INSTANT)) {
+  writeClient.getHeartbeatClient().stop(instant);
+  this.ckpMetadata.abortInstant(instant);

Review Comment:
   wondering if we need abort the ckpMetadata...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 opened a new pull request, #9175: [MINOR] StreamWriteOperatorCoordinator stop current instant heartbeat…

2023-07-11 Thread via GitHub



stream2000 opened a new pull request, #9175:
URL: https://github.com/apache/hudi/pull/9175

   … before start new instant
   
   ### Change Logs
   
   When some sub-tasks failed and global failure triggered, write tasks will be 
reset and re-deployed. Due to `StreamWriteCoordinator` do not extends 
`RecreateOnResetOperatorCoordinator`, the coordinator won't be close so the 
heartbeat thread of last instant will remain in the JVM.  
   
   The whole process is like: 
   sub tasks failed -> global failure trigged -> coodinator reset write tasks 
-> write tasks recreate and boostrap -> coordinator handle bootstrap event -> 
start new instant if don't need to recommit ( thus the last instant heartbeat 
will remain) 
   
   
   ### Impact
   
   stop current instant heartbeat before start new instant
   
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   NONE
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6524) Support re-attempting failed compaction

2023-07-11 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-6524:
-

 Summary: Support re-attempting failed compaction
 Key: HUDI-6524
 URL: https://issues.apache.org/jira/browse/HUDI-6524
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: sivabalan narayanan


When compaction fails and if user restarts ingestion, we need support to 
re-attempt compaction w/o doing delta commits. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6523) Fix get valid checkpoint for current writer

2023-07-11 Thread eric (Jira)

eric created HUDI-6523:
--

 Summary: Fix get valid checkpoint for current writer
 Key: HUDI-6523
 URL: https://issues.apache.org/jira/browse/HUDI-6523
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Reporter: eric
 Fix For: 0.14.0


23/07/11 16:50:57 INFO HoodieCompactor: Compactor compacting 
[CompactionOperation{baseInstantTime='20230711165027926', 
dataFileCommitTime=Option{val=20230711165027926}, 
deltaFileNames=[.0001-2bae-40d8-8038-401eefb9e7e3-0_20230711165027926.log.1_1-53-670],
 
dataFileName=Option{val=0001-2bae-40d8-8038-401eefb9e7e3-0_1-27-432_20230711165027926.parquet},
 id='HoodieFileGroupId{partitionPath='part=2023071116', 
fileId='0001-2bae-40d8-8038-401eefb9e7e3-0'}', 
metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
TOTAL_LOG_FILES_SIZE=3976.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
bootstrapFilePath=Optional.empty}, 
CompactionOperation{baseInstantTime='20230711165027926', 
dataFileCommitTime=Option{val=20230711165027926}, 
deltaFileNames=[.-eb61-4788-a9cb-aaa67e2e47c4-0_20230711165027926.log.1_0-53-671],
 
dataFileName=Option{val=-eb61-4788-a9cb-aaa67e2e47c4-0_0-27-431_20230711165027926.parquet},
 id='HoodieFileGroupId{partitionPath='part=2023071116', 
fileId='-eb61-4788-a9cb-aaa67e2e47c4-0'}', 
metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
TOTAL_LOG_FILES_SIZE=3592.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
bootstrapFilePath=Optional.empty}, 
CompactionOperation{baseInstantTime='20230711165027926', 
dataFileCommitTime=Option{val=20230711165027926}, 
deltaFileNames=[.0002-7160-4515-a0a6-7bcf2e03-0_20230711165027926.log.1_2-53-673],
 
dataFileName=Option{val=0002-7160-4515-a0a6-7bcf2e03-0_2-27-433_20230711165027926.parquet},
 id='HoodieFileGroupId{partitionPath='part=2023071116', 
fileId='0002-7160-4515-a0a6-7bcf2e03-0'}', 
metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
TOTAL_LOG_FILES_SIZE=3591.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
bootstrapFilePath=Optional.empty}, 
CompactionOperation{baseInstantTime='20230711165027926', 
dataFileCommitTime=Option{val=20230711165027926}, 
deltaFileNames=[.0003-5a31-411f-8430-ccf4bec128e8-0_20230711165027926.log.1_3-53-672],
 
dataFileName=Option{val=0003-5a31-411f-8430-ccf4bec128e8-0_3-27-434_20230711165027926.parquet},
 id='HoodieFileGroupId{partitionPath='part=2023071116', 
fileId='0003-5a31-411f-8430-ccf4bec128e8-0'}', 
metrics={TOTAL_LOG_FILES=1.0, TOTAL_IO_READ_MB=0.0, 
TOTAL_LOG_FILES_SIZE=3207.0, TOTAL_IO_WRITE_MB=0.0, TOTAL_IO_MB=0.0}, 
bootstrapFilePath=Optional.empty}] files
23/07/11 16:50:57 ERROR MicroBatchExecution: Query RateStreamSource [id = 
44581078-04ee-48ae-bc74-143b3c836a23, runId = 
916ce5a4-bd8a-4010-8c0e-869c58db41ab] terminated with error
org.apache.hudi.exception.HoodieIOException: Failed to parse 
HoodieCommitMetadata for [==>20230711165055791__compaction__REQUESTED]
at 
org.apache.hudi.common.util.CommitUtils.lambda$getValidCheckpointForCurrentWriter$1(CommitUtils.java:173)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at 
java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:356)
at 
java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:500)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464)
at 
org.apache.hudi.common.util.CommitUtils.getValidCheckpointForCurrentWriter(CommitUtils.java:175)
at 
org.apache.hudi.HoodieStreamingSink.canSkipBatch(HoodieStreamingSink.scala:313)
at 
org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:104)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:665)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:663)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:375)
at 
org

[GitHub] [hudi] nylqd commented on issue #9162: [SUPPORT] Permission denied access ckp_meta while start flink job

2023-07-11 Thread via GitHub



nylqd commented on issue #9162:
URL: https://github.com/apache/hudi/issues/9162#issuecomment-1631793719

   > @nylqd Can you please share the permissions for all directories under:
   > 
   > 
   > 
   > ```
   > 
   > /path_to_hudi_table/.hoodie/.aux/ckp_meta/.
   > 
   > ```
   > 
   > 
   > 
   > When restarting a job, under normal conditions, the `ckp_meta` is deleted 
and recreated again. 
   > 
   > 
   > 
   > The code (from master branch, iiuc, the code for this section hasn't 
changed that much since 0.11) for it is shown below:
   > 
   > 
   > 
   > https://github.com/apache/hudi/assets/6312314/bbf1884c-7daa-45e7-9624-e292c529c9e9";>
   > 
   > 
   > 
   > This is how mkdirs is implemented
   > 
   > 
   > 
   > https://github.com/apache/hudi/assets/6312314/20772086-7a6c-40b5-a812-bd1d7d4fa1be";>
   > 
   > 
   > 
   > Of which the default permissions of 777 is used
   > 
   > https://github.com/apache/hudi/assets/6312314/c4204d06-1ca6-4c86-b9b8-7419f87d8968";>
   
   this aux dir while job is running
   
   
![image](https://github.com/apache/hudi/assets/3632490/63108195-e046-4a69-828a-6588aa862095)
   
   after stop the job
   
   
![image](https://github.com/apache/hudi/assets/3632490/2ec6321d-5cf3-4141-8c48-ddf34995e79a)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260517417


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] hudi-bot commented on pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9158:
URL: https://github.com/apache/hudi/pull/9158#issuecomment-1631790941

   
   ## CI report:
   
   * cfc0f3271163b1d5cc16b4766210b67c61ecacf6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18488)
 
   * 5a26a9ab4d8ccd2f9d637c3a79517d44dc75cbbc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18510)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631790650

   
   ## CI report:
   
   * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509)
 
   * 66acd28baa87819b949d165775556169243ccc3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8832: [HUDI-6288] Create IngestionPrimaryWriterBasedConflictResolutionStrategy to prioritize ingestion writers over other writers

2023-07-11 Thread via GitHub



Zouxxyy commented on code in PR #8832:
URL: https://github.com/apache/hudi/pull/8832#discussion_r1260517417


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/IngestionPrimaryWriterBasedConflictResolutionStrategy.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.transaction;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.ClusteringUtils;
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.common.util.Option;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Comparator;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.COMPACTION_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.DELTA_COMMIT_ACTION;
+import static 
org.apache.hudi.common.table.timeline.HoodieTimeline.REPLACE_COMMIT_ACTION;
+
+/**
+ * This class extends the base implementation of conflict resolution strategy.
+ * It gives preference to ingestion writers compared to table services.
+ */
+public class IngestionPrimaryWriterBasedConflictResolutionStrategy
+extends SimpleConcurrentFileWritesConflictResolutionStrategy {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(IngestionPrimaryWriterBasedConflictResolutionStrategy.class);
+
+  /**
+   * For tableservices like replacecommit and compaction commits this method 
also returns ingestion inflight commits.
+   */
+  @Override
+  public Stream getCandidateInstants(HoodieTableMetaClient 
metaClient, HoodieInstant currentInstant,
+Option 
lastSuccessfulInstant) {
+HoodieActiveTimeline activeTimeline = metaClient.reloadActiveTimeline();
+if ((REPLACE_COMMIT_ACTION.equals(currentInstant.getAction())
+  && ClusteringUtils.isClusteringCommit(metaClient, currentInstant))
+|| COMPACTION_ACTION.equals(currentInstant.getAction())) {
+  return getCandidateInstantsForTableServicesCommits(activeTimeline, 
currentInstant);
+} else {
+  return getCandidateInstantsForNonTableServicesCommits(activeTimeline, 
currentInstant);
+}
+  }
+
+  private Stream 
getCandidateInstantsForNonTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+
+// To findout which instants are conflicting, we apply the following logic
+// Get all the completed instants timeline only for commits that have 
happened
+// since the last successful write based on the transition times.
+// We need to check for write conflicts since they may have mutated the 
same files
+// that are being newly created by the current write.
+List completedCommitsInstants = activeTimeline
+.getTimelineOfActions(CollectionUtils.createSet(COMMIT_ACTION, 
REPLACE_COMMIT_ACTION, COMPACTION_ACTION, DELTA_COMMIT_ACTION))
+.filterCompletedInstants()
+
.findInstantsModifiedAfterByStateTransitionTime(currentInstant.getTimestamp())
+.getInstantsOrderedByStateTransitionTime()
+.collect(Collectors.toList());
+LOG.info(String.format("Instants that may have conflict with %s are %s", 
currentInstant, completedCommitsInstants));
+return completedCommitsInstants.stream();
+  }
+
+  /**
+   * To find which instants are conflicting, we apply the following logic
+   * Get both completed instants and ingestion inflight commits that have 
happened since the last successful write.
+   * We need to check for write conflicts since they may have mutated the same 
files
+   * that are being newly created by the current write.
+   */
+  private Stream 
getCandidateInstantsForTableServicesCommits(HoodieActiveTimeline 
activeTimeline, HoodieInstant currentInstant) {
+// Fetch list of completed commits.
+Stream completedCommitsStream =
+activeT

[GitHub] [hudi] ChiaWeiGithub opened a new issue, #9174: [SUPPORT]flink sql hudi will create duplicate record on same s3 path when the data size is bigger enough to create new filegroup and different

2023-07-11 Thread via GitHub



ChiaWeiGithub opened a new issue, #9174:
URL: https://github.com/apache/hudi/issues/9174

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   flink sql hudi will create duplicate record on same s3 path when` the data 
size is bigger enough to create new filegroup and different flink session with 
different yarn application.`
   
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Downloading e-commerce data from 
https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store
   2. Upload to a s3 bucket
   3. Run the below query from Athena. Please modify the s3 path
   `
   
   CREATE EXTERNAL TABLE `ecommerce`(
 `event_time` string, 
 `event_type` string, 
 `product_id` bigint, 
 `category_id` bigint, 
 `category_code` string, 
 `brand` string, 
 `price` double, 
 `user_id` bigint, 
 `user_session` string)
   ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY ',' 
   STORED AS INPUTFORMAT 
 'org.apache.hadoop.mapred.TextInputFormat' 
   OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION
 's3:///opendata/ecommerce/'
   
   CREATE TABLE ecommercedistinct
   WITH (
 format = 'Parquet',
 write_compression = 'SNAPPY'
 )
   AS SELECT sum(price) as price,user_session,  CAST(NULL AS INTEGER) AS 
empty_column FROM "your_database"."ecommerce" where user_session IS NOT NULL 
group by user_session;
   `
   
   4. Since the data still has null data in user_session, it will break the 
query later on. We need to find out the null user session.
   
   select "$path" , user_session from "ecommercedistinct" order by  
user_session ASC;
   
   
   5. random pick up 5 file under the s3 path of table ecommercedistinct, but 
avoid the file with null user_session
   6. launch emr cluster primary : emr-6.7.0 with flink m5.2xlarge, 5 core node 
with m5.2xlarge. 
   7. create flink session
   `export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
   sudo /usr/lib/flink/bin/start-cluster.sh
   flink-yarn-session -d  -jm 12288 -tm 12288
   
   /usr/lib/flink/bin/sql-client.sh embedded -e /home/hadoop/sql-env.yaml 
   flink sql>
   
   add jar '/usr/lib/hudi/hudi-flink1.14-bundle_2.12-0.11.0-amzn-0.jar';
   
   # create source table. Please modified s3 path to your s3 path with data in 
partaccesslog.tar.gz
   CREATE TABLE tablesource1
(
`price` double,
`user_session` varchar(255)  PRIMARY KEY,
`empty_column` int
) with ('connector'='filesystem', 'path'=
   's3:///tables/5file/','format'='parquet'); 
   
   # create target table, set 'hoodie.copyonwrite.insert.split.size'='500' 
to enforce to create a bigger filegroup file, please modified s3 path 
accordingly.
   CREATE TABLE tabletarget1
(
`price` double,
`user_session` varchar(255)  PRIMARY KEY,
`empty_column` int
) with ('connector'='hudi', 'path'=
   's3:///ecommercetarget1/', 'table.type' =
   'COPY_ON_WRITE', 'hoodie.compaction.payload.class' =
   'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
   'write.payload.class'=
   'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
   'write.batch-size'='32',
   'hoodie.copyonwrite.insert.split.size'='500',
   'table.exec.sink.not-null-enforcer'='drop'); 
   
   
   INSERT INTO tabletarget1 select * from tablesource1; <== take about 6+ 
minutes to finished to create a file
   `
   
   7. Make duplicate record
   stop the sql session kill yarn application, create a new one start new flink 
sql session, random choose auser_session id and insert
   
   `
   INSERT INTO tabletarget1 (user_session,empty_column) VALUES 
('0bc5bd30-4b48-4637-b5f0-9e327707d2bc',1);
   
   `
   8. Go to Amazon Athena to check , we can see 2  
user_session='94CFQJ4T7409AW2P'
   
   `
   CREATE EXTERNAL TABLE `tabletarget`(
 `price` double, 
 `user_session` string, 
 `empty_column` int)
   ROW FORMAT SERDE 
 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
   STORED AS INPUTFORMAT 
 'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
   OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
 's3:///ecommercetarget1/';
   
   select "$path","user_session" from tabletarget where 
user_session='94CFQJ4T7409AW2P';
   `
   
   **Expected behavior**
   
   The record is not duplicate when the data size is small. For example, if I 
create an empty table and insert values, it will always be the deduplicate. No 
matter I user new fli

[GitHub] [hudi] hudi-bot commented on pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9158:
URL: https://github.com/apache/hudi/pull/9158#issuecomment-1631785976

   
   ## CI report:
   
   * cfc0f3271163b1d5cc16b4766210b67c61ecacf6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18488)
 
   * 5a26a9ab4d8ccd2f9d637c3a79517d44dc75cbbc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631780663

   
   ## CI report:
   
   * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631780464

   
   ## CI report:
   
   * 5f1f42b93a005a10f308559091bd1f329a7318ff Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18508)
 
   * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18509)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-11 Thread via GitHub



danny0405 commented on code in PR #9158:
URL: https://github.com/apache/hudi/pull/9158#discussion_r1260505794


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, 
int... ids) {
   @Override
   public List getCachedDataIds(HoodieDataCacheKey cacheKey) {
 synchronized (cachedRddIds) {
-  return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList());
+  return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>());

Review Comment:
   > Another way, I can new a ArrayList to addAll items, just like:
   
   This is always better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close

2023-07-11 Thread via GitHub



danny0405 commented on PR #9160:
URL: https://github.com/apache/hudi/pull/9160#issuecomment-1631772609

   cc @bvaradar , can you take a look at this change? I'm not very clear with 
the impact on Spark engine/workflow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-11 Thread via GitHub



KnightChess commented on code in PR #9158:
URL: https://github.com/apache/hudi/pull/9158#discussion_r1260502975


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, 
int... ids) {
   @Override
   public List getCachedDataIds(HoodieDataCacheKey cacheKey) {
 synchronized (cachedRddIds) {
-  return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList());
+  return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>());

Review Comment:
   fix this error, empty list class can not add item, I will merge two lists
   ```shell
   java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
   ```
   Another way, I can new a ArrayList to addAll items, just like:
   ```java
   if (config.areReleaseResourceEnabled()) {
 HoodieSparkEngineContext sparkEngineContext = 
(HoodieSparkEngineContext) context;
 Map> allCachedRdds = 
sparkEngineContext.getJavaSparkContext().getPersistentRDDs();
 List allDataIds = new 
ArrayList<>(sparkEngineContext.removeCachedDataIds(HoodieDataCacheKey.of(basePath,
 instantTime)));
 if (config.isMetadataTableEnabled()) {
   String metadataTableBasePath = 
HoodieTableMetadata.getMetadataTableBasePath(basePath);
   
allDataIds.addAll(sparkEngineContext.removeCachedDataIds(HoodieDataCacheKey.of(metadataTableBasePath,
 instantTime)));
 }
 for (int id : allDataIds) {
   if (allCachedRdds.containsKey(id)) {
 allCachedRdds.get(id).unpersist();
   }
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9165: [HUDI-6517] Throw error if deletion of invalid data file fails

2023-07-11 Thread via GitHub



danny0405 commented on code in PR #9165:
URL: https://github.com/apache/hudi/pull/9165#discussion_r1260496231


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java:
##
@@ -688,7 +688,10 @@ private void 
deleteInvalidFilesByPartitions(HoodieEngineContext context, Map

[GitHub] [hudi] danny0405 commented on issue #9132: [SUPPORT] hudi deltastreamer jsonkafka source schema registry fail

2023-07-11 Thread via GitHub



danny0405 commented on issue #9132:
URL: https://github.com/apache/hudi/issues/9132#issuecomment-1631762925

   > Exception in thread "main" org.apache.avro.SchemaParseException: Type not 
supported: object
at org.apache.avro.Schema.parse(Schema.java:1734)
at org.apache.avro.Schema$Parser.parse(Schema.java:1430)

   It is kind of apparent that the schema fetched from the provider is wrong, 
the avro can not parse correctly, not sure whether it is related with the avro 
version, what version of avro did you use when generating the schema string ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9172: [SUPPORT] java.lang.ClassCastException with incremental query

2023-07-11 Thread via GitHub



danny0405 commented on issue #9172:
URL: https://github.com/apache/hudi/issues/9172#issuecomment-1631754480

   It seems the vectorization execution is enabled while Hudi can not handle 
it, it is kind of related with the Spark version, can you use spark 3.2.x and 
try again ~, in 0.14.0, it would be fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9158: [MINOR] Unpersist only relevent metadata table RDDs

2023-07-11 Thread via GitHub



danny0405 commented on code in PR #9158:
URL: https://github.com/apache/hudi/pull/9158#discussion_r1260483097


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##
@@ -207,15 +206,15 @@ public void putCachedDataIds(HoodieDataCacheKey cacheKey, 
int... ids) {
   @Override
   public List getCachedDataIds(HoodieDataCacheKey cacheKey) {
 synchronized (cachedRddIds) {
-  return cachedRddIds.getOrDefault(cacheKey, Collections.emptyList());
+  return cachedRddIds.getOrDefault(cacheKey, new ArrayList<>());

Review Comment:
   why this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631750583

   
   ## CI report:
   
   * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507)
 
   * 5f1f42b93a005a10f308559091bd1f329a7318ff Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18508)
 
   * 28e822b1a4b83751727dbca0d3d632ca9ca2ac3e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



danny0405 commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631750486

   > then a wrong setting here would probably keep only a single HFile
   
   Can we add some validation logic in metadata table write config builder and 
guard the correctness? To keep at least 2 version for each file group will also 
double the storage for metadata table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631745561

   
   ## CI report:
   
   * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504)
 
   * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507)
 
   * 5f1f42b93a005a10f308559091bd1f329a7318ff UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on pull request #9169: [HUDI-6521] Disable failing test case.

2023-07-11 Thread via GitHub



stream2000 commented on PR #9169:
URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631745271

   Hi @amrishlal and @nsivabalan, sorry for the test case failure introduced.  
I have raised #9163 and fixed ci. Should we just revert this pr and enable the 
test again? Thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9168:
URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631739180

   
   ## CI report:
   
   * 55648de1a087b2081dde0ef11ef1eb6944e18e0c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18501)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631712005

   
   ## CI report:
   
   * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504)
 
   * 4360020cdb803dff6dc4e21bab9745ec287956f2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18507)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] soumilshah1995 commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource

2023-07-11 Thread via GitHub



soumilshah1995 commented on issue #9143:
URL: https://github.com/apache/hudi/issues/9143#issuecomment-1631710874

   
   # Project : Using Apache Hudi Deltastreamer and AWS DMS Hands on Labs
   
![image](https://user-images.githubusercontent.com/39345855/228927370-f7264d4a-f026-4014-9df4-b063f000f377.png)
   
   
   
   --
   Video Tutorials 
   * Part 1: Project Overview : https://www.youtube.com/watch?v=D9L0NLSqC1s
   * Part 2: Aurora Setup : https://youtu.be/HR5A6iGb4LE
   * Part 3: https://youtu.be/rnyj5gkIPKA
   * Part 4: https://youtu.be/J1xvPIcDIaQ
   * part 5 https://youtu.be/On-6L6oJ158
   
   * How to setup EMR cluster with VPC 
https://www.youtube.com/watch?v=-e1-Zsk17Ss&t=4s
   * PDF guide 
https://drive.google.com/file/d/1Hj_gyZ8o-wFf4tqTYZXOHNJz2f3ARepn/view
   
   --
   # Steps 
   ### Step 1:  Create Aurora Source Database and update the seetings to enable 
CDC on Postgres 
   *  Read More : 
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
   * Create new Create parameter group as shows in video and make sure to edit 
these two setting as shown in video part 2
   ```
   rds.logical_replication  1
   wal_sender_timeout   30
   ```
   once done please apply these to database and reboot your Database 
   
   
   ### Step 2: Run Python file to create a table called sales in public schema 
in aurora and lets populate some data into table 
   
   Run python run.py
   ```
   
   try:
   import os
   import logging
   
   from functools import wraps
   from abc import ABC, abstractmethod
   from enum import Enum
   from logging import StreamHandler
   
   import uuid
   from datetime import datetime, timezone
   from random import randint
   import datetime
   
   import sqlalchemy as db
   from faker import Faker
   import random
   import psycopg2
   import psycopg2.extras as extras
   from dotenv import load_dotenv
   
   load_dotenv(".env")
   except Exception as e:
   raise Exception("Error: {} ".format(e))
   
   
   class Logging:
   """
   This class is used for logging data to datadog an to the console.
   """
   
   def __init__(self, service_name, ddsource, logger_name="demoapp"):
   
   self.service_name = service_name
   self.ddsource = ddsource
   self.logger_name = logger_name
   
   format = "[%(asctime)s] %(name)s %(levelname)s %(message)s"
   self.logger = logging.getLogger(self.logger_name)
   formatter = logging.Formatter(format, )
   
   if logging.getLogger().hasHandlers():
   logging.getLogger().setLevel(logging.INFO)
   else:
   logging.basicConfig(level=logging.INFO)
   
   
   global logger
   logger = Logging(service_name="database-common-module", 
ddsource="database-common-module",
logger_name="database-common-module")
   
   
   def error_handling_with_logging(argument=None):
   def real_decorator(function):
   @wraps(function)
   def wrapper(self, *args, **kwargs):
   function_name = function.__name__
   response = None
   try:
   if kwargs == {}:
   response = function(self)
   else:
   response = function(self, **kwargs)
   except Exception as e:
   response = {
   "status": -1,
   "error": {"message": str(e), "function_name": 
function.__name__},
   }
   logger.logger.info(response)
   return response
   
   return wrapper
   
   return real_decorator
   
   
   class DatabaseInterface(ABC):
   @abstractmethod
   def get_data(self, query):
   """
   For given query fetch the data
   :param query: Str
   :return: Dict
   """
   
   def execute_non_query(self, query):
   """
   Inserts data into SQL Server
   :param query:  Str
   :return: Dict
   """
   
   def insert_many(self, query, data):
   """
   Insert Many items into database
   :param query: str
   :param data: tuple
   :return: Dict
   """
   
   def get_data_batch(self, batch_size=10, query=""):
   """
   Gets data into batches
   :param batch_size: INT
   :param query: STR
   :return: DICT
   """
   
   def get_table(self, table_name=""):
   """
   Gets the table from database
   :param table_name: STR
   :return: OBJECT
   """
   
   
   class Settings(object):
   """settings class"""
   
   def __init__(
   self,
   port="",

[GitHub] [hudi] soumilshah1995 commented on issue #9170: [SUPPORT] Possibly unquoted identifier in Glue V4 Streaming Job

2023-07-11 Thread via GitHub



soumilshah1995 commented on issue #9170:
URL: https://github.com/apache/hudi/issues/9170#issuecomment-1631709692

   can you try these code
   
   ```
   
   try:
   import sys
   from awsglue.transforms import *
   from awsglue.utils import getResolvedOptions
   
   from pyspark.sql.session import SparkSession
   from pyspark.context import SparkContext
   
   from awsglue.context import GlueContext
   from awsglue.job import Job
   
   from pyspark.sql import DataFrame, Row
   from pyspark.sql.functions import *
   from pyspark.sql.functions import col, to_timestamp, 
monotonically_increasing_id, to_date, when
   import datetime
   from awsglue import DynamicFrame
   import pyspark.sql.functions as F
   import boto3
   except Exception as e:
   print("ERROR IMPORTS ", e)
   
   """
   
   {
   "version": "0",
   "id": "6b437ef5-9686-f7c7-b7dd-a2438bf5c92c",
   "detail-type": "order",
   "source": "order",
   "account": "043916019468",
   "time": "2023-02-12T10:42:48Z",
   "region": "us-west-1",
   "resources": [],
   "detail": {
   "orderid": "a59a6e61-ebb3-4442--53d9654ab3eb",
   "customer_id": "71d03aa7-5179-4c62-ab7d-e9cbf981d889",
   "ts": "2023-02-12T10:42:48.264802",
   "order_value": "691",
   "priority": "MEDIUM"
   }
   }
   ---
   INPUT
   ---
   root
|-- account: string (nullable = true)
|-- detail: struct (nullable = true)
||-- customer_id: string (nullable = true)
||-- order_value: string (nullable = true)
||-- orderid: string (nullable = true)
||-- priority: string (nullable = true)
||-- ts: string (nullable = true)
|-- detail-type: string (nullable = true)
|-- id: string (nullable = true)
|-- region: string (nullable = true)
|-- resources: array (nullable = true)
||-- element: string (containsNull = true)
|-- source: string (nullable = true)
|-- time: string (nullable = true)
|-- version: string (nullable = true)
   
   ---
   OUTPUT
   ---
   root
|-- account: string (nullable = true)
|-- detail_type: string (nullable = true)
|-- event_id: string (nullable = true)
|-- region: string (nullable = true)
|-- source: string (nullable = true)
|-- time: string (nullable = true)
|-- version: string (nullable = true)
|-- detail_customer_id: string (nullable = true)
|-- detail_order_value: string (nullable = true)
|-- detail_orderid: string (nullable = true)
|-- detail_priority: string (nullable = true)
|-- detail_ts: string (nullable = true)
   
   """
   
   
   def create_spark_session():
   spark = SparkSession \
   .builder \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .getOrCreate()
   return spark
   
   
   spark = create_spark_session()
   sc = spark.sparkContext
   glueContext = GlueContext(sc)
   
   # == Settings 

   
   
   db_name = "hudidb"
   kinesis_table_name = 'kinesis_order_events'
   table_name = "order"
   
   record_key = 'detail_orderid'
   precomb = 'detail_ts'
   s3_bucket = 'sX'
   s3_path_hudi = f's3a://{s3_bucket}/{table_name}/'
   s3_path_spark = f's3://{s3_bucket}/spark_checkpoints/'
   
   method = 'upsert'
   table_type = "MERGE_ON_READ"
   
   window_size = '10 seconds'
   starting_position_of_kinesis_iterator = 'trim_horizon'
   
   connection_options = {
   'hoodie.table.name': table_name,
   "hoodie.datasource.write.storage.type": table_type,
   'hoodie.datasource.write.recordkey.field': record_key,
   'hoodie.datasource.write.table.name': table_name,
   'hoodie.datasource.write.operation': method,
   'hoodie.datasource.write.precombine.field': precomb,
   
   'hoodie.datasource.hive_sync.enable': 'true',
   "hoodie.datasource.hive_sync.mode": "hms",
   'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
   'hoodie.datasource.hive_sync.database': db_name,
   'hoodie.datasource.hive_sync.table': table_name,
   'hoodie.datasource.hive_sync.use_jdbc': 'false',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.write.hive_style_partitioning': 'true',
   
   }
   
   # ==
   
   
   starting_position_of_kinesis_iterator = starting_position_of_kinesis_iterator
   
   data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
   database=db_name,
   table_name=kinesis_table_name,
   transformation_ctx="DataSource0",
   additional_options={"inferSchema": "true", "startingPosition": 
starting_position_of_kinesis_iterator}
   )
   
   
   def flatten_df(nested_df):
   flat_cols = [c[0] for

[GitHub] [hudi] soumilshah1995 commented on issue #9172: [SUPPORT] java.lang.ClassCastException with incremental query

2023-07-11 Thread via GitHub



soumilshah1995 commented on issue #9172:
URL: https://github.com/apache/hudi/issues/9172#issuecomment-1631708363

   can you share some code or other details ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #9160: [HUDI-6501] HoodieHeartbeatClient should stop all heartbeats and not delete heartbeat files for close

2023-07-11 Thread via GitHub



leesf commented on code in PR #9160:
URL: https://github.com/apache/hudi/pull/9160#discussion_r1260443951


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/heartbeat/HoodieHeartbeatClient.java:
##
@@ -185,36 +179,55 @@ public void start(String instantTime) {
   }
 
   /**
-   * Stops the heartbeat for the specified instant.
-   * @param instantTime
+   * Stops the heartbeat and deletes the heartbeat file for the specified 
instant.
+   *
+   * @param instantTime The instant time for the heartbeat.
* @throws HoodieException
*/
   public void stop(String instantTime) throws HoodieException {
 Heartbeat heartbeat = instantToHeartbeatMap.get(instantTime);
-if (heartbeat != null && heartbeat.isHeartbeatStarted() && 
!heartbeat.isHeartbeatStopped()) {
-  LOG.info("Stopping heartbeat for instant " + instantTime);
-  heartbeat.getTimer().cancel();
-  heartbeat.setHeartbeatStopped(true);
-  LOG.info("Stopped heartbeat for instant " + instantTime);
+if (isHeartbeatStarted(heartbeat)) {
+  stopHeartbeatTimer(heartbeat);
   HeartbeatUtils.deleteHeartbeatFile(fs, basePath, instantTime);
   LOG.info("Deleted heartbeat file for instant " + instantTime);
 }
   }
 
   /**
-   * Stops all heartbeats started via this instance of the client.
+   * Stops all timers of heartbeats started via this instance of the client.
+   *
* @throws HoodieException
*/
-  public void stop() throws HoodieException {
-instantToHeartbeatMap.values().forEach(heartbeat -> 
stop(heartbeat.getInstantTime()));
+  public void stopHeartbeatTimers() throws HoodieException {
+
instantToHeartbeatMap.values().stream().filter(this::isHeartbeatStarted).forEach(this::stopHeartbeatTimer);
+  }
+
+  /**
+   * Whether the given heartbeat is started.
+   *
+   * @param heartbeat The heartbeat to check whether is started.
+   * @return Whether the heartbeat is started.
+   * @throws IOException
+   */
+  private boolean isHeartbeatStarted(Heartbeat heartbeat) {
+return heartbeat != null && heartbeat.isHeartbeatStarted() && 
!heartbeat.isHeartbeatStopped();

Review Comment:
   use `heartbeat != null && heartbeat.isHeartbeatStarted()` or `heartbeat != 
null && heartbeat.isHeartbeatStopped()` is enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631706730

   
   ## CI report:
   
   * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397)
 
   * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504)
 
   * 4360020cdb803dff6dc4e21bab9745ec287956f2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9145:
URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631673249

   
   ## CI report:
   
   * 10f9adc8c62d6fae7219a4472cba010a1c1c0da0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18472)
 
   * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18506)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9131:
URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631673194

   
   ## CI report:
   
   * e0622a6d0f2f7294ed079dd42cf5ff65a8718da3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18478)
 
   * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18505)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631672847

   
   ## CI report:
   
   * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397)
 
   * b5e4f56f3b7c905cfc782c997d586f717f93889c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18504)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9145: [HUDI-6464] Codreview changes for Spark SQL Merge Into for pkless tables'

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9145:
URL: https://github.com/apache/hudi/pull/9145#issuecomment-1631667354

   
   ## CI report:
   
   * 10f9adc8c62d6fae7219a4472cba010a1c1c0da0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18472)
 
   * 9e3c05bdf4ef8cb5bb800ccde85fde085b0d07af UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9131: [HUDI-6315] Feature flag for disabling optimized update/delete code path.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9131:
URL: https://github.com/apache/hudi/pull/9131#issuecomment-1631667269

   
   ## CI report:
   
   * e0622a6d0f2f7294ed079dd42cf5ff65a8718da3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18478)
 
   * c6e27b201d5fe4ca6eff34f0efdd5a8da5a3a45d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8949: [DNM] Testing Java 17

2023-07-11 Thread via GitHub



hudi-bot commented on PR #8949:
URL: https://github.com/apache/hudi/pull/8949#issuecomment-1631666998

   
   ## CI report:
   
   * 021d75ea25b016c6482fec0838e14ce72e4c05cd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18397)
 
   * b5e4f56f3b7c905cfc782c997d586f717f93889c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9173:
URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631661084

   
   ## CI report:
   
   * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9165: [HUDI-6517] Throw error if deletion of invalid data file fails

2023-07-11 Thread via GitHub



nsivabalan commented on code in PR #9165:
URL: https://github.com/apache/hudi/pull/9165#discussion_r1260403670


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java:
##
@@ -688,7 +688,10 @@ private void 
deleteInvalidFilesByPartitions(HoodieEngineContext context, Map

[hudi] branch master updated: [HUDI-6521] Disable failing test case. (#9169)

2023-07-11 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 51ddf1affcd [HUDI-6521] Disable failing test case. (#9169)
51ddf1affcd is described below

commit 51ddf1affcdead2e3b5e871ba4816c71e6f4b99a
Author: Amrish Lal 
AuthorDate: Tue Jul 11 16:51:29 2023 -0700

[HUDI-6521] Disable failing test case. (#9169)
---
 .../src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
index 7cd90145507..9508da7eaa2 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
@@ -1191,7 +1191,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase {
 }
   }
 
-  test("Test Bulk Insert Into Bucket Index Table") {
+  /** Ignore failing test case (see HUDI-6521 for more details) */
+  ignore("Test Bulk Insert Into Bucket Index Table") {
 withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", 
"hoodie.bulkinsert.shuffle.parallelism" -> "1") {
   Seq("mor", "cow").foreach { tableType =>
 Seq("true", "false").foreach { bulkInsertAsRow =>

[GitHub] [hudi] nsivabalan merged pull request #9169: [HUDI-6521] Disable failing test case.

2023-07-11 Thread via GitHub



nsivabalan merged PR #9169:
URL: https://github.com/apache/hudi/pull/9169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on pull request #9156: [HUDI-6515] Fix bucket index row writer write record to wrong handle

2023-07-11 Thread via GitHub



stream2000 commented on PR #9156:
URL: https://github.com/apache/hudi/pull/9156#issuecomment-1631644198

   >  The test case modified in this PR is failing. 
   
   Sorry for the inconvenience, we have fixed the ci in #9163, could you please 
rebase the lastest master? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1260391602


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##
@@ -217,6 +218,8 @@ public List close() {
 
   setupWriteStatus();
 
+  // createCompleteMarkerFile throws hoodieException, if marker directory 
is not present.
+  createCompletedMarkerFile(partitionPath, this.instantTime);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] amrishlal commented on pull request #9169: [HUDI-6521] Disable failing test case.

2023-07-11 Thread via GitHub



amrishlal commented on PR #9169:
URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631635510

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1260389155


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java:
##
@@ -135,18 +155,42 @@ private String translateMarkerToDataPath(String 
markerPath) {
 return stripMarkerSuffix(rPath);
   }
 
+  public static String stripMarkerSuffix(String path) {
+return path.substring(0, 
path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN));
+  }
+
+  public static String stripOldStyleMarkerSuffix(String path) {
+// marker file was created by older version of Hudi, with 
INPROGRESS_MARKER_EXTN (f1_w1_c1.marker).
+// Rename to data file by replacing .marker with .parquet.
+return String.format("%s%s", path.substring(0, 
path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)),
+HoodieFileFormat.PARQUET.getFileExtension());
+  }
+
   @Override
   public Set allMarkerFilePaths() throws IOException {
 Set markerFiles = new HashSet<>();
 if (doesMarkerDirExist()) {
   FSUtils.processFiles(fs, markerDirPath.toString(), fileStatus -> {
-
markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(),
 basePath, instantTime));
+// Only the inprogres markerFiles are to be included here
+if 
(fileStatus.getPath().toString().contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN))
 {
+  
markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(),
 basePath, instantTime));
+}
 return true;
   }, false);
 }
 return markerFiles;
   }
 
+  public boolean markerExists(Path markerPath) {
+boolean objExists;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-1631633479

   > 
   With respect to the partial files, when the feature flag is enabled, it will 
reduce the possibility of a partial data files getting created.  Since second 
and subsequent attempts, in the presence of of completed marker, would return 
success (with previously completed writeStatus) we are minimizing multiple 
tasks attempting to create files.
   
   Deleting the .hoodie/.temp/  marker folder, after the finalizeWrite, 
ensures that any new task attempting to create a data file (that would show up 
as a partially written file, when the Job/JVM exits) will fail at the time of 
writeHandle creation, if the marker folder is absent (and not create a partial 
file).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9173: Dynamic Partition Pruning Port

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9173:
URL: https://github.com/apache/hudi/pull/9173#issuecomment-1631627162

   
   ## CI report:
   
   * 0d0d0cde8aa18c5e93ce256208ea3b27b210a815 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9169: [HUDI-6521] Disable failing test case.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9169:
URL: https://github.com/apache/hudi/pull/9169#issuecomment-1631627132

   
   ## CI report:
   
   * 7ccae5025f90656628abc0b75d452d4b3ad1dab6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18496)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9168:
URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631627090

   
   ## CI report:
   
   * 5783dc533f3360e756c581544a53e6996add1653 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499)
 
   * 55648de1a087b2081dde0ef11ef1eb6944e18e0c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18501)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631626905

   
   ## CI report:
   
   * eb56e1be9ea831362a61adccec2ec2826c86d6a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18240)
 
   * 2c07b3e13de51845aad4e280c5fb07688f103d4a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18500)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1260376601


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/TimelineServerBasedWriteMarkers.java:
##
@@ -86,6 +89,17 @@ public TimelineServerBasedWriteMarkers(HoodieTable table, 
String instantTime) {
 this.timeoutSecs = timeoutSecs;
   }
 
+  @Override
+  protected Path getMarkerPath(String partitionPath, String dataFileName, 
IOType type) {
+return new Path(partitionPath, getMarkerFileName(dataFileName, type));
+  }

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1260376138


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -512,6 +512,7 @@ public List close() {
 status.getStat().setFileSizeInBytes(logFileSize);
   }
 
+  createCompletedMarkerFile(partitionPath, baseInstantTime);

Review Comment:
   Done. Added the check and changed the function to 
createCompletedMarkerFileIfRequired()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



nbalajee commented on code in PR #9035:
URL: https://github.com/apache/hudi/pull/9035#discussion_r1260374587


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java:
##
@@ -135,18 +155,42 @@ private String translateMarkerToDataPath(String 
markerPath) {
 return stripMarkerSuffix(rPath);
   }
 
+  public static String stripMarkerSuffix(String path) {
+return path.substring(0, 
path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN));
+  }
+
+  public static String stripOldStyleMarkerSuffix(String path) {
+// marker file was created by older version of Hudi, with 
INPROGRESS_MARKER_EXTN (f1_w1_c1.marker).
+// Rename to data file by replacing .marker with .parquet.
+return String.format("%s%s", path.substring(0, 
path.indexOf(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN)),
+HoodieFileFormat.PARQUET.getFileExtension());
+  }
+
   @Override
   public Set allMarkerFilePaths() throws IOException {
 Set markerFiles = new HashSet<>();
 if (doesMarkerDirExist()) {
   FSUtils.processFiles(fs, markerDirPath.toString(), fileStatus -> {
-
markerFiles.add(MarkerUtils.stripMarkerFolderPrefix(fileStatus.getPath().toString(),
 basePath, instantTime));
+// Only the inprogres markerFiles are to be included here
+if 
(fileStatus.getPath().toString().contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN))
 {

Review Comment:
   explained above.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/marker/DirectWriteMarkers.java:
##
@@ -119,7 +139,7 @@ public Set 
createdAndMergedDataPaths(HoodieEngineContext context, int pa
 while (itr.hasNext()) {
   FileStatus status = itr.next();
   String pathStr = status.getPath().toString();
-  if (pathStr.contains(HoodieTableMetaClient.MARKER_EXTN) && 
!pathStr.endsWith(IOType.APPEND.name())) {
+  if (pathStr.contains(HoodieTableMetaClient.INPROGRESS_MARKER_EXTN) 
&& !pathStr.endsWith(IOType.APPEND.name())) {
 result.add(translateMarkerToDataPath(pathStr));

Review Comment:
   1. Every attempt to create a datafile will have a corresponding in-progress 
marker file.
   2. A successful attempt at creating a data file would create a completion 
marker (if the flag is enabled)
   3. If second/subsequent attempts are made to recreate the file and a 
completion marker exists when trying to create the write handle, if the flag is 
enabled, old write status is returned and the attempt is considered successful.
   4. While an on-going write is in-progress (file write not completed yet), if 
a second attempt is started, a new in-progress marker will be created for the 
second/subsequent attempt.  At the time of closing the writeHandle/file, the 
first process would create the completion marker.  Second/subsequent files 
trying to close the writeHandle would cleanup the data file, upon seeing the 
presence of a completion marker.
   5. When reconciling data files, all in-progress markers are read and the 
list is pruned by removing entries that have a writeStatus.  Data files that 
have been created but don't have a corresponding write status are candidates 
for deletion during finalizeWrite.
   
   During step 5, only in-progress markers are considered (completed markers 
are not considered, as writeStatus is used as source of truth).



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -138,9 +139,35 @@ protected Path makeNewFilePath(String partitionPath, 
String fileName) {
*
* @param partitionPath Partition path
*/
-  protected void createMarkerFile(String partitionPath, String dataFileName) {
-WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime)
-.create(partitionPath, dataFileName, getIOType(), config, fileId, 
hoodieTable.getMetaClient().getActiveTimeline());
+  protected void createInProgressMarkerFile(String partitionPath, String 
dataFileName, String markerInstantTime) {
+WriteMarkers writeMarkers = 
WriteMarkersFactory.get(config.getMarkersType(), hoodieTable, instantTime);
+if (!writeMarkers.doesMarkerDirExist()) {
+  throw new HoodieIOException(String.format("Marker root directory absent 
: %s/%s (%s)",
+  partitionPath, dataFileName, markerInstantTime));
+}
+if (config.enforceFinalizeWriteCheck()
+&& writeMarkers.markerExists(writeMarkers.getCompletionMarkerPath("", 
"FINALIZE_WRITE", markerInstantTime, IOType.CREATE))) {

Review Comment:
   > In general, instead of exactly match of the state, I prefer the final 
consistency of the files diff we have on master.
   
   +1 on performing final consistency of files based on the write statuses.  
This is our preferred approach.   However, this change is trying to address 
extreme corner cases, that could result in extra files left on the dataset's 
partition folder, that might appea

[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9168:
URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631621513

   
   ## CI report:
   
   * 5783dc533f3360e756c581544a53e6996add1653 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499)
 
   * 55648de1a087b2081dde0ef11ef1eb6944e18e0c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631621309

   
   ## CI report:
   
   * eb56e1be9ea831362a61adccec2ec2826c86d6a7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18240)
 
   * 2c07b3e13de51845aad4e280c5fb07688f103d4a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jonvex opened a new pull request, #9173: Dynamic Partition Pruning Port

2023-07-11 Thread via GitHub



jonvex opened a new pull request, #9173:
URL: https://github.com/apache/hudi/pull/9173

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9168:
URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631615497

   
   ## CI report:
   
   * 5783dc533f3360e756c581544a53e6996add1653 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18494)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18495)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18499)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9035: [HUDI-6416] Completion markers for handling execution engine (spark) …

2023-07-11 Thread via GitHub



hudi-bot commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-1631615225

   
   ## CI report:
   
   * 93088bc330d7c96229f24e40910e067e4e4caf36 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18497)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



prashantwason commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1631610969

   @danny0405 @nsivabalan I think the cleaning strategy change for MDT is a 
bugfix because of the following enhancements:
   1. Initial commit on the MDT will create hfiles
   2. Rollbacks not actually rollback the MDT instead of adding a -f1, -f2 
deltacommit 
   
   If we KEEP_LATEST_COMMITS then a wrong setting here would probably keep only 
a single HFile and that will limit the rollback. We cannot rollback the MDT 
beyond the last hfile as we will lose data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bkosuru opened a new issue, #9172: [SUPPORT] java.lang.ClassCastException with incremental query

2023-07-11 Thread via GitHub



bkosuru opened a new issue, #9172:
URL: https://github.com/apache/hudi/issues/9172

   Hudi version: hudi-spark3.3-bundle_2.12-0.13.1
   Table type: COW
   Env: GCP Dataproc batches V 1.1
   
   I am getting the following exception with incremental query:
   
   23/07/11 22:27:50 ERROR TaskSetManager: Task 552 in stage 1.0 failed 4 
times; aborting job
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
to stage failure: Task 552 in stage 1.0 failed 4 times, most recent failure: 
Lost task 552.3 in stage 1.0 (TID 771) (10.12.0.121 executor 34): 
java.lang.ClassCastException: class 
org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class 
org.apache.spark.sql.catalyst.InternalRow 
(org.apache.spark.sql.vectorized.ColumnarBatch and 
org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app')
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:514)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1(HashAggregateExec.scala:98)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$1$adapted(HashAggregateExec.scala:95)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
   
   Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2716)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2652)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2651)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2651)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1189)
at scala.Option.foreach(Option.scala:407)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1189)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2904)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2846)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2835)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   Caused by: java.lang.ClassCastException: class 
org.apache.spark.sql.vectorized.ColumnarBatch cannot be cast to class 
org.apache.spark.sql.catalyst.InternalRow 
(org.apache.spark.sql.vectorized.ColumnarBatch and 
org.apache.spark.sql.catalyst.InternalRow are in unnamed module of loader 'app')
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.

[GitHub] [hudi] sydneyhoran closed pull request #9171: Thescore changes

2023-07-11 Thread via GitHub



sydneyhoran closed pull request #9171: Thescore changes
URL: https://github.com/apache/hudi/pull/9171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] sydneyhoran opened a new pull request, #9171: Thescore changes

2023-07-11 Thread via GitHub



sydneyhoran opened a new pull request, #9171:
URL: https://github.com/apache/hudi/pull/9171

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



prashantwason commented on code in PR #9106:
URL: https://github.com/apache/hudi/pull/9106#discussion_r1260360731


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java:
##
@@ -116,11 +134,10 @@ public static HoodieWriteConfig createMetadataWriteConfig(
 // Below config is only used if isLogCompactionEnabled is set.
 
.withLogCompactionBlocksThreshold(writeConfig.getMetadataLogCompactBlocksThreshold())
 .build())
-.withParallelism(parallelism, parallelism)
-.withDeleteParallelism(parallelism)
-.withRollbackParallelism(parallelism)
-.withFinalizeWriteParallelism(parallelism)
-.withAllowMultiWriteOnSameInstant(true)
+
.withStorageConfig(HoodieStorageConfig.newBuilder().hfileMaxFileSize(maxHFileSizeBytes)
+
.logFileMaxSize(maxLogFileSizeBytes).logFileDataBlockMaxSize(maxLogBlockSizeBytes).build())
+.withRollbackParallelism(defaultParallelism)
+.withFinalizeWriteParallelism(defaultParallelism)

Review Comment:
   Yes. This was required because the previous code for commits would overwrite 
the same instant if it already exists. With the already commited rollback PR, 
if we get a commit on MDT with same timestamp as previous applied deltacommit 
then we will first rollback the previously applied deltacommit on MDT and then 
commit the new change. Hence, multi-write-on-same-instant is never possible in 
MDT.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on a diff in pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-11 Thread via GitHub



prashantwason commented on code in PR #9106:
URL: https://github.com/apache/hudi/pull/9106#discussion_r1260359877


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -176,7 +176,7 @@ private void initMetadataReader() {
 }
 
 try {
-  this.metadata = new HoodieBackedTableMetadata(engineContext, 
dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath());
+  this.metadata = new HoodieBackedTableMetadata(engineContext, 
dataWriteConfig.getMetadataConfig(), dataWriteConfig.getBasePath(), true);

Review Comment:
   When initializing additional indexes (after files partition is already 
initialized), the file listing is taken from the HoodieBackedTableMetadata 
itself. In this case, it is better to have the reuse enabled so we dont keep 
listing each time an additional index is initialized. 
   
   This is an optimization for the case when FILES index exists and we are 
initializing one or more indexes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #9168: [HUDI-6276] Rename HoodieDeltaStreamer to HoodieStreamer

2023-07-11 Thread via GitHub



yihua commented on PR #9168:
URL: https://github.com/apache/hudi/pull/9168#issuecomment-1631606184

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 193 matches

Mail list logo