[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454658647

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   * 10d3659dcc94bc069d0da83ee3b711bf4ff079fe Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15573)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125415237


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -181,22 +181,24 @@ private boolean isLocalViewBehind(Context ctx) {
* Syncs data-set view if local view is behind.
*/
   private boolean syncIfLocalViewBehind(Context ctx) {
-if (isLocalViewBehind(ctx)) {
-  String basePath = 
ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
-  String lastKnownInstantFromClient = 
ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-  SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
-  synchronized (view) {
-if (isLocalViewBehind(ctx)) {
-  HoodieTimeline localTimeline = 
viewManager.getFileSystemView(basePath).getTimeline();
-  LOG.info("Syncing view as client passed last known instant " + 
lastKnownInstantFromClient
-  + " as last known instant but server has the following last 
instant on timeline :"
-  + localTimeline.lastInstant());
-  view.sync();
-  return true;
-}
+boolean result = false;
+String basePath = 
ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
+SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
+synchronized (view) {
+  if (isLocalViewBehind(ctx)) {
+
+String lastKnownInstantFromClient = ctx.queryParamAsClass(
+RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class)
+.getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
+HoodieTimeline localTimeline = 
viewManager.getFileSystemView(basePath).getTimeline();
+LOG.info("Syncing view as client passed last known instant " + 
lastKnownInstantFromClient
++ " as last known instant but server has the following last 
instant on timeline :"
++ localTimeline.lastInstant());
+view.sync();
+result = true;

Review Comment:
   Good catch! The variable is not needed.  Fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454657002

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   * 10d3659dcc94bc069d0da83ee3b711bf4ff079fe UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125388754


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -282,6 +286,20 @@ public void reset() {
 }
   }
 
+  /**
+   * Resets the view states, which can be overridden by subclasses.  This 
reset logic is guarded
+   * by the write lock.
+   * 
+   * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic.  DO NOT 
OVERRIDE
+   * {@link AbstractTableFileSystemView#reset} directly, which may cause stale 
file system view
+   * to be served.
+   */
+  protected void runReset() {

Review Comment:
   Removed `runSync` and `runReset` methods to avoid confusion and make every 
implementation explicitly use write lock except remote FSV.  If new file system 
view needs to be added, the author should look at existing implementation for 
reference.  Renaming won't prevent the author doing the wrong thing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


danny0405 commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125407040


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -181,22 +181,24 @@ private boolean isLocalViewBehind(Context ctx) {
* Syncs data-set view if local view is behind.
*/
   private boolean syncIfLocalViewBehind(Context ctx) {
-if (isLocalViewBehind(ctx)) {
-  String basePath = 
ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
-  String lastKnownInstantFromClient = 
ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-  SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
-  synchronized (view) {
-if (isLocalViewBehind(ctx)) {
-  HoodieTimeline localTimeline = 
viewManager.getFileSystemView(basePath).getTimeline();
-  LOG.info("Syncing view as client passed last known instant " + 
lastKnownInstantFromClient
-  + " as last known instant but server has the following last 
instant on timeline :"
-  + localTimeline.lastInstant());
-  view.sync();
-  return true;
-}
+boolean result = false;
+String basePath = 
ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
+SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
+synchronized (view) {
+  if (isLocalViewBehind(ctx)) {
+
+String lastKnownInstantFromClient = ctx.queryParamAsClass(
+RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class)
+.getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
+HoodieTimeline localTimeline = 
viewManager.getFileSystemView(basePath).getTimeline();
+LOG.info("Syncing view as client passed last known instant " + 
lastKnownInstantFromClient
++ " as last known instant but server has the following last 
instant on timeline :"
++ localTimeline.lastInstant());
+view.sync();
+result = true;

Review Comment:
   Can we return directly from this line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454572272

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * c162956f9f418b4603328c37f9e2babf59613d4b Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571)
 
   * 7fff406e74cdf3faf047634a2d596399fa49f059 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15572)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454564278

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564)
 
   * c162956f9f418b4603328c37f9e2babf59613d4b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571)
 
   * 7fff406e74cdf3faf047634a2d596399fa49f059 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454564173

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390671


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -269,19 +269,35 @@ public void close() {
 
   /**
* Clears the partition Map and reset view states.
+   * 
+   * NOTE: This method SHOULD NOT BE OVERRIDDEN which may cause stale file 
system view
+   * to be served.  Instead, override {@link 
AbstractTableFileSystemView#runReset} to
+   * add custom logic.
*/
   @Override
   public void reset() {
 try {
   writeLock.lock();
-  clear();
-  // Initialize with new Hoodie timeline.
-  init(metaClient, getTimeline());
+  runReset();
 } finally {
   writeLock.unlock();
 }
   }
 
+  /**
+   * Resets the view states, which can be overridden by subclasses.  This 
reset logic is guarded
+   * by the write lock.
+   * 
+   * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic.  DO NOT 
OVERRIDE

Review Comment:
   No longer needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390520


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -269,19 +269,35 @@ public void close() {
 
   /**
* Clears the partition Map and reset view states.
+   * 
+   * NOTE: This method SHOULD NOT BE OVERRIDDEN which may cause stale file 
system view
+   * to be served.  Instead, override {@link 
AbstractTableFileSystemView#runReset} to
+   * add custom logic.
*/
   @Override
   public void reset() {

Review Comment:
   no longer needed as discussed.  We directly use the write lock in each 
overriding implementation instead of indirect usage.



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java:
##
@@ -90,14 +90,14 @@ protected Map, FileStatus[]> 
listPartitions(List

[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390087


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -78,6 +79,10 @@ public class RequestHandler {
   private final BaseFileHandler dataFileHandler;
   private final MarkerHandler markerHandler;
   private final Registry metricsRegistry = 
Registry.getRegistry("TimelineService");
+  // This read-write lock is used for syncing the file system view if it is 
behind client's view

Review Comment:
   This read-write lock is removed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125389847


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -151,30 +156,38 @@ public void stop() {
* Determines if local view of table's timeline is behind that of client's 
view.
*/
   private boolean isLocalViewBehind(Context ctx) {
-String basePath = 
ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM);
-String lastKnownInstantFromClient =
-ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, 
String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS);
-String timelineHashFromClient = 
ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, 
String.class).getOrDefault("");
-HoodieTimeline localTimeline =
-
viewManager.getFileSystemView(basePath).getTimeline().filterCompletedOrMajorOrMinorCompactionInstants();
-if (LOG.isDebugEnabled()) {
-  LOG.debug("Client [ LastTs=" + lastKnownInstantFromClient + ", 
TimelineHash=" + timelineHashFromClient
-  + "], localTimeline=" + localTimeline.getInstants());
-}
+try {
+  // This read lock makes sure that if the local view of the table is 
being synced,
+  // no timeline server requests should be processed or handled until the 
sync process

Review Comment:
   @danny0405 This is simplified now.  You can also check my updated PR 
description for how the race condition can happen.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454556422

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564)
 
   * c162956f9f418b4603328c37f9e2babf59613d4b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454556333

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   * e4c3c1dac4ae60c71219183167b491379f181ab0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454555052

   > Is this a regression? what version? Can I look at the offending commit to 
understand how it was before.
   
   I updated the PR description to provide more detailed information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125389071


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -187,12 +200,20 @@ private boolean syncIfLocalViewBehind(Context ctx) {
   SyncableFileSystemView view = viewManager.getFileSystemView(basePath);
   synchronized (view) {
 if (isLocalViewBehind(ctx)) {
-  HoodieTimeline localTimeline = 
viewManager.getFileSystemView(basePath).getTimeline();
-  LOG.info("Syncing view as client passed last known instant " + 
lastKnownInstantFromClient
-  + " as last known instant but server has the following last 
instant on timeline :"
-  + localTimeline.lastInstant());
-  view.sync();
-  return true;
+  try {

Review Comment:
   As synced offline, only keeping the synchronized block now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


yihua commented on code in PR #8079:
URL: https://github.com/apache/hudi/pull/8079#discussion_r1125388754


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -282,6 +286,20 @@ public void reset() {
 }
   }
 
+  /**
+   * Resets the view states, which can be overridden by subclasses.  This 
reset logic is guarded
+   * by the write lock.
+   * 
+   * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic.  DO NOT 
OVERRIDE
+   * {@link AbstractTableFileSystemView#reset} directly, which may cause stale 
file system view
+   * to be served.
+   */
+  protected void runReset() {

Review Comment:
   Removed `runSync` and `runReset` methods to avoid confusion and make every 
implementation explicitly use write lock except remote FSV.  If new file system 
view needs to be added, the author should look at existing implementation for 
reference.  Renaming won't prevent them the author doing the wrong thing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


danny0405 commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454547592

   @hudi-bot run travis


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] voonhous commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?

2023-03-03 Thread via GitHub


voonhous commented on issue #8071:
URL: https://github.com/apache/hudi/issues/8071#issuecomment-1454525599

   > Sorry for late reply, did you already use the append and it is still slow?
   
   Yeap, judging from the stack trace, he is running his job under append only 
mode.
   
   ```log
   
org.apache.hudi.sink.append.AppendWriteFunction.initWriterHelper(AppendWriteFunction.java:110
   ```
   
   > Then we switched to the snappy format, and the writing performance did 
improve to a certain extent. However, due to the Tencent Cloud COS we used for 
storage, there was a list frequency control problem in cow writing, so the 
overall performance could not be greatly improved,and the exception is as 
follows:
   
   This  feels like a COS issue. @DavidZ1 you mentioned `there was a list 
frequency control problem in cow writing`. So, it's spending too much time 
listing files? IIUC, your job might be running too many parquet files while 
flushing? I am not very familiar with COS, so I am taking a shot in the dark 
here, looking at your configurations, the default `write.parquet.max.file.size` 
is used, which is 120MB by default.
   
   Perhaps, you could try increasing this so that lesser parquet files are 
written? Do note that your parquet sizes will get larger.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454506895

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564)
 
   * c162956f9f418b4603328c37f9e2babf59613d4b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454499436

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8076:
URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454490473

   
   ## CI report:
   
   * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506)
 
   * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN
   * f384bbc843028360687903b3b6de835685235b68 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8076:
URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454442192

   
   ## CI report:
   
   * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506)
 
   * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN
   * f384bbc843028360687903b3b6de835685235b68 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8076:
URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454433889

   
   ## CI report:
   
   * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506)
 
   * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (18d528f33d8 -> d40a6211f64)

2023-03-03 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 18d528f33d8 [HUDI-5736] Common de-coupling column drop flag and schema 
validation flag (#7895)
 add d40a6211f64 [HUDI-5796] Adding auto inferring partition from incoming 
df (#7951)

No new revisions were added by this update.

Summary of changes:
 .../testsuite/dag/nodes/SparkDeleteNode.scala  |   2 +-
 .../dag/nodes/SparkDeletePartitionNode.scala   |   2 +-
 .../testsuite/dag/nodes/SparkInsertNode.scala  |   2 +-
 .../scala/org/apache/hudi/DataSourceOptions.scala  |  45 
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |   2 +-
 .../apache/hudi/functional/TestCOWDataSource.scala | 117 +++--
 6 files changed, 136 insertions(+), 34 deletions(-)



[GitHub] [hudi] nsivabalan merged pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


nsivabalan merged PR #7951:
URL: https://github.com/apache/hudi/pull/7951


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] raghavant-git commented on issue #8016: Inline Clustering : Clustering failed to write to files

2023-03-03 Thread via GitHub


raghavant-git commented on issue #8016:
URL: https://github.com/apache/hudi/issues/8016#issuecomment-1454402165

   thanks for the response will test the above parameters and update it here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a diff in pull request #7907: [HUDI-5672][RFC-61] Lockless multi writer support

2023-03-03 Thread via GitHub


vinothchandar commented on code in PR #7907:
URL: https://github.com/apache/hudi/pull/7907#discussion_r1122538301


##
rfc/rfc-61/rfc-61.md:
##
@@ -0,0 +1,98 @@
+# RFC-61: Lockless Multi Writer
+
+## Proposers
+- @danny0405
+- @ForwardXu
+- @SteNicholas
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, explicit lock is not 
very capable of putting into production, in this RFC, we propse a lockless 
solution for streaming ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
+the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
+
+Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
+the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
+we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
+
+## Design
+
+### The Precondition
+
+ MOR Table Type Is Required
+
+The table type must be `MERGE_ON_READ`, so that we can defer the conflict 
resolution to the compaction phase. The compaction service would resolve the 
conflicts of the same keys by respecting the event time sequence of the events.

Review Comment:
   compaction or merge on read for queries.



##
rfc/rfc-61/rfc-61.md:
##
@@ -0,0 +1,98 @@
+# RFC-61: Lockless Multi Writer
+
+## Proposers
+- @danny0405
+- @ForwardXu
+- @SteNicholas
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, explicit lock is not 
very capable of putting into production, in this RFC, we propse a lockless 
solution for streaming ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. 

[GitHub] [hudi] danny0405 commented on issue #8018: [SUPPORT] why is the schema evolution done while not setting hoodie.schema.on.read.enable

2023-03-03 Thread via GitHub


danny0405 commented on issue #8018:
URL: https://github.com/apache/hudi/issues/8018#issuecomment-1454372483

   I guess we need a clear doc to elaborate the schema evolution details for 
0.13.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5736) De-coupling column drop flag and schema validation flag in Flink

2023-03-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5736.

Resolution: Fixed

Fixed via master branch: 18d528f33d8b1dd7a836e5543ddf36e0a9c95ad1

> De-coupling column drop flag and schema validation flag in Flink
> 
>
> Key: HUDI-5736
> URL: https://issues.apache.org/jira/browse/HUDI-5736
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink, writer-core
>Reporter: Alexander Trushev
>Assignee: Alexander Trushev
>Priority: Major
>  Labels: pull-request-available
>
> Fix https://issues.apache.org/jira/browse/HUDI-5704 for Flink engine



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-5736] Common de-coupling column drop flag and schema validation flag (#7895)

2023-03-03 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 18d528f33d8 [HUDI-5736] Common de-coupling column drop flag and schema 
validation flag (#7895)
18d528f33d8 is described below

commit 18d528f33d8b1dd7a836e5543ddf36e0a9c95ad1
Author: Alexander Trushev 
AuthorDate: Sat Mar 4 11:00:25 2023 +0700

[HUDI-5736] Common de-coupling column drop flag and schema validation flag 
(#7895)

* [HUDI-5736] Common de-coupling column drop flag and schema validation flag
---
 .../java/org/apache/hudi/table/HoodieTable.java| 40 ++---
 .../hudi/table/TestHoodieMergeOnReadTable.java |  1 +
 .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 65 ++
 .../org/apache/hudi/avro/TestAvroSchemaUtils.java  | 57 +++
 .../apache/hudi/sink/ITTestDataStreamWrite.java| 52 +
 .../resources/test_read_schema_dropped_age.avsc| 41 ++
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  4 +-
 .../AlterHoodieTableChangeColumnCommand.scala  |  2 +-
 .../hudi/command/MergeIntoHoodieTableCommand.scala |  3 +-
 9 files changed, 242 insertions(+), 23 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
index 2a71cf4ea46..8b1056bca6c 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.table;
 
+import org.apache.hudi.avro.AvroSchemaUtils;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.avro.model.HoodieCleanMetadata;
 import org.apache.hudi.avro.model.HoodieCleanerPlan;
@@ -92,6 +93,9 @@ import org.apache.log4j.Logger;
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;
@@ -100,7 +104,6 @@ import java.util.function.Function;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
-import static org.apache.hudi.avro.AvroSchemaUtils.isSchemaCompatible;
 import static 
org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy.EAGER;
 import static 
org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy.LAZY;
 import static 
org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS;
@@ -803,27 +806,22 @@ public abstract class HoodieTable implements 
Serializable {
*/
   private void validateSchema() throws HoodieUpsertException, 
HoodieInsertException {
 
-if (!shouldValidateAvroSchema() || 
getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) {
+boolean shouldValidate = config.shouldValidateAvroSchema();
+boolean allowProjection = config.shouldAllowAutoEvolutionColumnDrop();
+if ((!shouldValidate && allowProjection)
+|| 
getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) {
   // Check not required
   return;
 }
 
-Schema tableSchema;
-Schema writerSchema;
-boolean isValid;
 try {
   TableSchemaResolver schemaResolver = new 
TableSchemaResolver(getMetaClient());
-  writerSchema = 
HoodieAvroUtils.createHoodieWriteSchema(config.getSchema());
-  tableSchema = 
HoodieAvroUtils.createHoodieWriteSchema(schemaResolver.getTableAvroSchema(false));
-  isValid = isSchemaCompatible(tableSchema, writerSchema, 
config.shouldAllowAutoEvolutionColumnDrop());
+  Schema writerSchema = 
HoodieAvroUtils.createHoodieWriteSchema(config.getSchema());
+  Schema tableSchema = 
HoodieAvroUtils.createHoodieWriteSchema(schemaResolver.getTableAvroSchema(false));
+  AvroSchemaUtils.checkSchemaCompatible(tableSchema, writerSchema, 
shouldValidate, allowProjection, getDropPartitionColNames());
 } catch (Exception e) {
   throw new HoodieException("Failed to read schema/check compatibility for 
base path " + metaClient.getBasePath(), e);
 }
-
-if (!isValid) {
-  throw new HoodieException("Failed schema compatibility check for 
writerSchema :" + writerSchema
-  + ", table schema :" + tableSchema + ", base path :" + 
metaClient.getBasePath());
-}
   }
 
   public void validateUpsertSchema() throws HoodieUpsertException {
@@ -1041,11 +1039,15 @@ public abstract class HoodieTable 
implements Serializable {
 return Functions.noop();
   }
 
-  private boolean shouldValidateAvroSchema() {
-// TODO(HUDI-4772) re-enable validations in case partition columns
-// being dropped from the data-file after fixing the write 
schema
-

[GitHub] [hudi] danny0405 merged pull request #7895: [HUDI-5736] Common de-coupling column drop flag and schema validation flag

2023-03-03 Thread via GitHub


danny0405 merged PR #7895:
URL: https://github.com/apache/hudi/pull/7895


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454366553

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   * e4c3c1dac4ae60c71219183167b491379f181ab0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454364773

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   * e4c3c1dac4ae60c71219183167b491379f181ab0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7687: [HUDI-5606] Update to handle deletes in postgres debezium

2023-03-03 Thread via GitHub


danny0405 commented on PR #7687:
URL: https://github.com/apache/hudi/pull/7687#issuecomment-1454364001

   Reviewing now, can you add some test cases for the payload thing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454363083

   
   ## CI report:
   
   * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15561)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454363052

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc

2023-03-03 Thread via GitHub


danny0405 commented on issue #8092:
URL: https://github.com/apache/hudi/issues/8092#issuecomment-1454362530

   Thanks, can you fire a PR to asf-site branch and fix that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 closed issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc

2023-03-03 Thread via GitHub


danny0405 closed issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc
URL: https://github.com/apache/hudi/issues/8092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?

2023-03-03 Thread via GitHub


danny0405 commented on issue #8071:
URL: https://github.com/apache/hudi/issues/8071#issuecomment-1454362050

   Sorry for late reply, did you already use the append and it is still slow?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-5812) Optimize the data size check in HoodieBaseParquetWriter

2023-03-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-5812.
--

> Optimize the data size check in HoodieBaseParquetWriter
> ---
>
> Key: HUDI-5812
> URL: https://issues.apache.org/jira/browse/HUDI-5812
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5812) Optimize the data size check in HoodieBaseParquetWriter

2023-03-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-5812.

Fix Version/s: 0.13.1
   0.14.0
   Resolution: Fixed

Fixed via master branch: 2a52bc03d90d88c518d5ab377dc01e717813522b

> Optimize the data size check in HoodieBaseParquetWriter
> ---
>
> Key: HUDI-5812
> URL: https://issues.apache.org/jira/browse/HUDI-5812
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.1, 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated: [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter (#7978)

2023-03-03 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a52bc03d90 [HUDI-5812] Optimize the data size check in 
HoodieBaseParquetWriter (#7978)
2a52bc03d90 is described below

commit 2a52bc03d90d88c518d5ab377dc01e717813522b
Author: Rex(Hui) An 
AuthorDate: Sat Mar 4 11:36:06 2023 +0800

[HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter (#7978)

Use a exponentially elastic algorithm to probe the .canWrite flag.
---
 .../hudi/io/storage/HoodieBaseParquetWriter.java   |  38 +--
 .../io/storage/TestHoodieBaseParquetWriter.java| 122 +
 2 files changed, 150 insertions(+), 10 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java
 
b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java
index e38b41d422a..a82c26bae92 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java
@@ -21,6 +21,7 @@ package org.apache.hudi.io.storage;
 import org.apache.hadoop.fs.Path;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.hudi.common.util.VisibleForTesting;
 import org.apache.parquet.hadoop.ParquetFileWriter;
 import org.apache.parquet.hadoop.ParquetWriter;
 import org.apache.parquet.hadoop.api.WriteSupport;
@@ -28,6 +29,9 @@ import org.apache.parquet.hadoop.api.WriteSupport;
 import java.io.IOException;
 import java.util.concurrent.atomic.AtomicLong;
 
+import static 
org.apache.parquet.column.ParquetProperties.DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK;
+import static 
org.apache.parquet.column.ParquetProperties.DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK;
+
 /**
  * Base class of Hudi's custom {@link ParquetWriter} implementations
  *
@@ -36,11 +40,9 @@ import java.util.concurrent.atomic.AtomicLong;
  */
 public abstract class HoodieBaseParquetWriter extends ParquetWriter {
 
-  private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 
1000;
-
   private final AtomicLong writtenRecordCount = new AtomicLong(0);
   private final long maxFileSize;
-  private long lastCachedDataSize = -1;
+  private long recordCountForNextSizeCheck;
 
   public HoodieBaseParquetWriter(Path file,
  HoodieParquetConfig> parquetConfig) throws IOException {
@@ -62,17 +64,28 @@ public abstract class HoodieBaseParquetWriter extends 
ParquetWriter {
 // stream and the actual file size reported by HDFS
 this.maxFileSize = parquetConfig.getMaxFileSize()
 + Math.round(parquetConfig.getMaxFileSize() * 
parquetConfig.getCompressionRatio());
+this.recordCountForNextSizeCheck = DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK;
   }
 
   public boolean canWrite() {
-// TODO we can actually do evaluation more accurately:
-//  if we cache last data size check, since we account for how many 
records
-//  were written we can accurately project avg record size, and 
therefore
-//  estimate how many more records we can write before cut off
-if (lastCachedDataSize == -1 || getWrittenRecordCount() % 
WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK == 0) {
-  lastCachedDataSize = getDataSize();
+long writtenCount = getWrittenRecordCount();
+if (writtenCount >= recordCountForNextSizeCheck) {
+  long dataSize = getDataSize();
+  // In some very extreme cases, like all records are same value, then 
it's possible
+  // the dataSize is much lower than the writtenRecordCount(high 
compression ratio),
+  // causing avgRecordSize to 0, we'll force the avgRecordSize to 1 for 
such cases.
+  long avgRecordSize = Math.max(dataSize / writtenCount, 1);
+  // Follow the parquet block size check logic here, return false
+  // if it is within ~2 records of the limit
+  if (dataSize > (maxFileSize - avgRecordSize * 2)) {
+return false;
+  }
+  recordCountForNextSizeCheck = writtenCount + Math.min(
+  // Do check it in the halfway
+  Math.max(DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK, (maxFileSize / 
avgRecordSize - writtenCount) / 2),
+  DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK);
 }
-return lastCachedDataSize < maxFileSize;
+return true;
   }
 
   @Override
@@ -84,4 +97,9 @@ public abstract class HoodieBaseParquetWriter extends 
ParquetWriter {
   protected long getWrittenRecordCount() {
 return writtenRecordCount.get();
   }
+
+  @VisibleForTesting
+  protected long getRecordCountForNextSizeCheck() {
+return recordCountForNextSizeCheck;
+  }
 }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/io/storage/TestHoodieBaseParquetWriter.java
 

[GitHub] [hudi] danny0405 merged pull request #7978: [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter

2023-03-03 Thread via GitHub


danny0405 merged PR #7978:
URL: https://github.com/apache/hudi/pull/7978


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


danny0405 commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454358726

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-03 Thread via GitHub


danny0405 commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1454358368

   > > @yuzhaojing @xushiyan The changes to the write client are done when 
introducing the new table service client. Before that, based on my 
understanding, the inline table services running along with the regular write 
client share the same timeline server. So I think with the new table service 
client, we should still follow the same convention. Is there anything I miss? 
When the table service manager is used, how's the interplay between the 
timeline server and the table service manager?
   > > cc @nsivabalan
   > > Before we fully agree on the approach here, let's not merge this PR. 
Also, I'd like to add some tests to guard around the expected behavior, after 
the discussion.
   > 
   > @yihua @danny0405 @xushiyan I'm sorry for this serious bug. I think the 
table service client should share the same timeline server as the regular write 
client. Here I think the following tests can be added to the table service 
client:
   > 
   > 1. Add unit tests to confirm that the table service client has not made 
unexpected modifications to writeConfig.
   > 2. Confirm that the table service of the table service client is scheduled 
and executed normally.
   > 3. Call correctly after starting the managed service.
   > 
   > Want to hear your thoughts and apologize again!
   
   Yeah, we need some basic UT for the service client.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454346850

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * a3062bb83dc4bdbdc39bb3ff4a5c612b2cb5401d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15514)
 
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8079:
URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454344932

   
   ## CI report:
   
   * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN
   * a3062bb83dc4bdbdc39bb3ff4a5c612b2cb5401d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15514)
 
   * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7836: [Q] get history of a given record?

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7836:
URL: https://github.com/apache/hudi/issues/7836#issuecomment-1454337243

   hey @meeting90 : if you question is resolved, can you close out the issue. 
if not, let us know how else we can help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336739

   but as far as trimming down the no of files, we don't have any automatic 
support as of now. but will be working on it. 
   if you are interested to work on it, let us know. we can guide you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7800:
URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336475

   hey @phani482 
   sorry for the late turn aorund. 
   Have you enabled sync by any chance? recently we found an issue where meta 
sync is loading the archival timeline unnecessarily. 
   
   https://github.com/apache/hudi/pull/7561
   
   If you can try w/ 0.13.0 and let us know what do you see, would be nice. or 
you can cherry-pick this commit into your internal fork if you have one. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7829:
URL: https://github.com/apache/hudi/issues/7829#issuecomment-1454335343

   I might know why this could be happening. 
   if you can clarify something, we can confirm.
   
   for a given df, while generating the primary key using monotonically 
increasing func, if we call the key generation twice, it could return diff keys 
right? just that spark will ensure they are unqiue. but it may not be the same?
   
   bcoz, down the line, our upsert partitioner is based on the hash of the 
record key. so, if for one of the spark partitions, if spark dag is 
re-triggered, chances that re-attempt of primary key generation could result in 
a new set of keys (whose hash value) might differ compared to first time, you 
might see duplicates or data loss. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7897:
URL: https://github.com/apache/hudi/issues/7897#issuecomment-1454333905

   hey @menna224 :
   let me clarify something and then will ask some clarification. 
   
   Commit1: 
   Key1, val1 : file1_v1.parquet. 
   
   Commit2: 
   key2, val2: file1_v2.parquet 
   
   both file1_v1 and file1_v2 belongs to same file group. When you do read 
query, hudi will only read file1_v2.parquet. this is due to small file 
handling. Cleaner when its get executed later, will clean up file1_v1.parquet. 
but once file1_v2.parquet is created, none of your snapshot queries will read 
from file1_v1.
   
   Commit3:
   key3, val3.: again due to small file handling, file1_v3.parquet. 
   
   Commit4: 
   key3, val4 (same key as before, but an update)
   Hudi will add a log file to file1 (file group). 
   
   So, on disk 
   its file1_v3.parquet and log_file1.parquet. 
   
   with rt, hudi will read both of them, merge and server. 
   incase of ro, hudi will read just file1_v3.parquet. 
   
   Lets say, we keep adding more updates for key3. more log files will be 
added. 
   once compaction kicks in, a new parquet file will be created 
   file1_v4.parquet (which is a merged version of file1_v3 + all associated log 
files).
   
   Can you clarify whats the issue you are seeing. your example wasn't very 
clear for me. 
   esply on these statements. 
   ```
   then after the 10th update where i changed the name to "joe", I can see 10 
log files, and only 1 parquet file, the parquet file that is kept is the last 
one (file3.parquet) with the old values not the updates ones:
   (id=3,name=mg)
   (id=4,name=sa)
   (id=5,name=john)
   
   and file1.parquet  were delted.
   rt table contained the right values (the three records and the last record 
has a value joe for the coloum name)
   ro contained the values that's in the parquet
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454306233

   
   ## CI report:
   
   * d5333e95b609d585c00404c55151830108dd160c Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560)
 
   * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15561)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454303245

   
   ## CI report:
   
   * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559)
 
   * d5333e95b609d585c00404c55151830108dd160c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560)
 
   * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454240600

   
   ## CI report:
   
   * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559)
 
   * d5333e95b609d585c00404c55151830108dd160c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454222596

   
   ## CI report:
   
   * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559)
 
   * d5333e95b609d585c00404c55151830108dd160c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454180148

   
   ## CI report:
   
   * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454173948

   
   ## CI report:
   
   * 61dda6da1e111009d968f3af1735f56b43181be7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15542)
 
   * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454173661

   
   ## CI report:
   
   * 9ae7b06b3f38d34875349f98d5e64390ab6d60db Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15558)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454064985

   
   ## CI report:
   
   * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   * 9ae7b06b3f38d34875349f98d5e64390ab6d60db Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15558)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454056652

   
   ## CI report:
   
   * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   * 9ae7b06b3f38d34875349f98d5e64390ab6d60db UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-5840) [DOCS] Add spark procedures do docs

2023-03-03 Thread kazdy (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696292#comment-17696292
 ] 

kazdy commented on HUDI-5840:
-

closing as there's PR open for it already:
https://github.com/apache/hudi/pull/8004

> [DOCS] Add spark procedures do docs
> ---
>
> Key: HUDI-5840
> URL: https://issues.apache.org/jira/browse/HUDI-5840
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kazdy
>Assignee: kazdy
>Priority: Minor
>
> Add spark procedures do docs, most are missing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-5840) [DOCS] Add spark procedures do docs

2023-03-03 Thread kazdy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kazdy closed HUDI-5840.
---
Resolution: Duplicate

> [DOCS] Add spark procedures do docs
> ---
>
> Key: HUDI-5840
> URL: https://issues.apache.org/jira/browse/HUDI-5840
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kazdy
>Assignee: kazdy
>Priority: Minor
>
> Add spark procedures do docs, most are missing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on issue #7906: [SUPPORT] compaction error - Avro field '_hoodie_operation' not found

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7906:
URL: https://github.com/apache/hudi/issues/7906#issuecomment-1454047912

   @danny0405 @bhasudha : do we need an FAQ or trouble shooting guide entry 
around this. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7909: Failed to create Marker file

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7909:
URL: https://github.com/apache/hudi/issues/7909#issuecomment-1454047432

   @koochiswathiTR : any updates on this end. If the issue got resolved, can 
you please close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7910: [SUPPORT]

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7910:
URL: https://github.com/apache/hudi/issues/7910#issuecomment-1454046650

   You can also check 
https://medium.com/@simpsons/apache-hudis-small-file-management-17d8c61b20e6 
for reference. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454046521

   
   ## CI report:
   
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7910: [SUPPORT]

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7910:
URL: https://github.com/apache/hudi/issues/7910#issuecomment-1454046187

   Is its a COW or MOR table? 
   COW:
   if you look at S3 directly, you might find older files too. Hudi after 
rewriting to a newer version of the base file, will not delete the older file 
immediately. Cleaner will take care of it. But your queries/reader will only 
read the latest version of the data file. 
   
   But if you w/ MOR table, its more nuanced. 
   By default only one file group (w/o any log files) are considered for small 
file bin packing. 
   If you wish more files to be picked up, you can try tweaking 
https://hudi.apache.org/docs/configurations/#hoodiemergesmallfilegroupcandidateslimit
 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454046221

   
   ## CI report:
   
   * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] GallonREX commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-03-03 Thread via GitHub


GallonREX commented on issue #7925:
URL: https://github.com/apache/hudi/issues/7925#issuecomment-1454041406

   这是自动回复。谢谢您的邮件,您的邮件我已收到,我将尽快回复您。


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7925:
URL: https://github.com/apache/hudi/issues/7925#issuecomment-1454040912

   Generally multi-writering capablity means, both the writer can write 
concurrently only if they don't have overlapping data being ingested. 
   for eg, if both are ingesting to two different partitions completely. If 
not, hudi may not be able to resolve the winner and hence will abort/fail one 
of the writer. 
   
   Its expected. 
   
   Can you clarify if two writers are writing non-overlapping data and still 
results in concurrent modification exception. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7960: [SUPPORT]

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7960:
URL: https://github.com/apache/hudi/issues/7960#issuecomment-1454031910

   yeah. you need to set ` --source-ordering-field` as well. which is 
equivalent to preCombine field if you were go ingest via spark data source 
writer. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7990: [SUPPORT]Is It possible to update hudi table with data that having fewer columns

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7990:
URL: https://github.com/apache/hudi/issues/7990#issuecomment-145400

   thanks! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #7990: [SUPPORT]Is It possible to update hudi table with data that having fewer columns

2023-03-03 Thread via GitHub


nsivabalan closed issue #7990: [SUPPORT]Is It possible to update hudi table 
with data that having fewer columns
URL: https://github.com/apache/hudi/issues/7990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-1454007979

   can you clarify something. what exactly is your hudi table base path? 
   `/data/testfolder`
   is it `data` or is it `/data/testfolder`? 
   Hudi will not do any list operations for parent of hudi table base path. 
   But if you have other non hudi folders within hudi table base path, it could 
try to list those folders. 
   Depends on whether you have metedata enabled or not. But if you can clarify 
whats the base path and your findings on high no of LIST calls for which dir, 
we can go from there. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7991:
URL: https://github.com/apache/hudi/issues/7991#issuecomment-145340

   We fixed an issue w/ hive sync loading archived timeline unnecessarily 
https://github.com/apache/hudi/pull/7561
   with 0.13.0, it should not be the case anymore. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #7996: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieIOException: IOException when reading logblock from log file HoodieLogFile{pathStr='s3://dataho

2023-03-03 Thread via GitHub


nsivabalan commented on issue #7996:
URL: https://github.com/apache/hudi/issues/7996#issuecomment-1453996107

   Actually we fixed something on this end recently. 
   https://github.com/apache/hudi/pull/7561
   
   Can you try 0.13.0. We expect it should get fixed. Or you can pull this 
patch to your internal fork if you maintain one. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8036:
URL: https://github.com/apache/hudi/issues/8036#issuecomment-1453987238

   good question.
   
   Depending on what sql tool you might use, you can try to explore how to 
select all columns except a few. then, you can ignore the hoodie meta columns 
explicitly in your insert into statement. 
   
   
   For eg, for spark sql, you can do the following 
   
   spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
   
   #select all columns except a,b
   sql("select `(a|b)?+.+` from tmp").show()
   #+---+---+
   #| id|  c|
   #+---+---+
   #|  1|  4|
   #+---+---+
   
   Ref: 
https://stackoverflow.com/questions/63127263/how-to-select-all-columns-except-2-of-them-from-a-large-table-on-pyspark-sql
   
   Hive: 
https://stackoverflow.com/questions/51227890/hive-how-to-select-all-but-one-column
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5875) Fix index look/matching records for MERGE INTO with MOR table

2023-03-03 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-5875:
-

 Summary: Fix index look/matching records for MERGE INTO with MOR 
table
 Key: HUDI-5875
 URL: https://issues.apache.org/jira/browse/HUDI-5875
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: sivabalan narayanan


MERGE INTO statement w/ MOR table might result go wrong in some corner case. 

 

where a record is valid as per base file, but has gotten a delete in log files, 

following this, if a user executes below MERGE_INTO statement 
{code:java}
merge into hudi_table2 using (select * from source) as b on (hudi_table2.id = 
b.id and hudi_table2.name=b.name) when not matched then insert *; {code}
In this case, a record that was deleted in log file, might appear as though its 
valid record w/ our index look up. 

 

This will not be an issue w/ COW table or after compaction kicks in. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on issue #8034: [SUPPORT]merge into didn`t reinsert the delete record

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8034:
URL: https://github.com/apache/hudi/issues/8034#issuecomment-1453971798

   Created a ticket https://issues.apache.org/jira/browse/HUDI-5875 to follow 
up. 
   This will not be an issue w/ COW table or after compaction kicks in for the 
file group of interest. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8034: [SUPPORT]merge into didn`t reinsert the delete record

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8034:
URL: https://github.com/apache/hudi/issues/8034#issuecomment-1453965150

   I can explain whats happening under the hood. 
   not sure how we can fix it properly. Might need to think deep. 
   
   After step 8 above, delete of id=1 goes into a log file in hudi_table2. So, 
if you do a sanpshot read from table2, you will not see id=1 record. But if you 
do an index look up, it might show as though id=1 belongs to hudi_table2 untill 
compaction kicks in. So, during step9, the merge into results in an index 
lookup (when not matched), both id=1 and id=2 are seen as valid records from 
hudi_table2. and so it does not re-insert anything. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8031: [SUPPORT] Hudi Timestamp Based Key Generator Need Assistance

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8031:
URL: https://github.com/apache/hudi/issues/8031#issuecomment-1453950617

   ```
   
   import java.sql.Timestamp
   import spark.implicits._
   
   val df = Seq(
 (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"),
 (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"),
 (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"),
 (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def")
   ).toDF("typeId","eventTime", "str")
   
   
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.common.model.HoodieRecord
   
   
   
   df.write.format("hudi").
   option("hoodie.insert.shuffle.parallelism", "2").
   option("hoodie.upsert.shuffle.parallelism", "2").
 option("hoodie.datasource.write.precombine.field", "typeId").
 option("hoodie.datasource.write.partitionpath.field", "eventTime").
 option("hoodie.datasource.write.recordkey.field", "str").
 
option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
 
option("hoodie.deltastreamer.keygen.timebased.timestamp.type","DATE_STRING").
 option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
 
option("hoodie.deltastreamer.keygen.timebased.input.dateformat","-MM-dd 
hh:mm:ss").
 
option("hoodie.deltastreamer.keygen.timebased.output.dateformat","-MM-dd").
 
option("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true").
 option("hoodie.table.name", "hudi_tbl").
 mode(Overwrite).
 save("/tmp/hudi_tbl_trial/")
   
   ```
   
   ls of base path 
   ```
   ls -ltr /tmp/hudi_tbl_trial/
   total 0
   drwxr-xr-x  6 nsb  wheel  192 Mar  3 10:40 2016-12-30
   drwxr-xr-x  6 nsb  wheel  192 Mar  3 10:40 2014-01-02
   drwxr-xr-x  6 nsb  wheel  192 Mar  3 10:40 2016-05-10
   drwxr-xr-x  6 nsb  wheel  192 Mar  3 10:40 2014-12-01
   ```
   
   
   
   If you prefer slash encoded 
   ```
   
option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd")
   ```
   
   but dir will be 3 level deep
   ```
   ls -ltr /tmp/hudi_tbl_trial/
   total 0
   drwxr-xr-x  4 nsb  wheel  128 Mar  3 10:42 2014
   drwxr-xr-x  4 nsb  wheel  128 Mar  3 10:42 2016
   nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/
   total 0
   drwxr-xr-x  3 nsb  wheel  96 Mar  3 10:42 01
   drwxr-xr-x  3 nsb  wheel  96 Mar  3 10:42 12
   nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/01/
   total 0
   drwxr-xr-x  6 nsb  wheel  192 Mar  3 10:42 02
   nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/01/02/
   total 856
   -rw-r--r--  1 nsb  wheel  434759 Mar  3 10:42 
b02e5e6f-9d28-42d1-b257-3728e534d477-0_3-49-76_20230303104246958.parquet
   ```
   
   
   Guess you were missing   
option("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true").
   
https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled-1
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #8021: [DOC] Adding new stored procedures and brief documentation to the Hudi website

2023-03-03 Thread via GitHub


nsivabalan closed issue #8021: [DOC] Adding new stored procedures and brief 
documentation to the Hudi website
URL: https://github.com/apache/hudi/issues/8021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8021: [DOC] Adding new stored procedures and brief documentation to the Hudi website

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8021:
URL: https://github.com/apache/hudi/issues/8021#issuecomment-1453926176

   sure @kazdy . that would be really great. Do you think you can add examples 
when you put up one. that would definitely benefit the community. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8025: Found commits after time :20230220161017756, please rollback greater commits first

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8025:
URL: https://github.com/apache/hudi/issues/8025#issuecomment-1453925203

   we also made some fix on rolling back a completed instant 
https://github.com/apache/hudi/pull/6313. can you try 0.12.1 may be. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8025: Found commits after time :20230220161017756, please rollback greater commits first

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8025:
URL: https://github.com/apache/hudi/issues/8025#issuecomment-1453922536

   can you post the contents of ".hoodie" w/ last mod time intact (ls -ltr). 
   Also, when you triggered rollback via cli, whats the entire command you 
passed. 
   
   I see we have an option `--rollbackUsingMarkers`. did you set it or no ? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8016: Inline Clustering : Clustering failed to write to files

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8016:
URL: https://github.com/apache/hudi/issues/8016#issuecomment-1453917024

   Please check out these properties. 
   
   Max num groups:
   
   hoodie.clustering.plan.strategy.max.num.groups: Maximum number of groups to 
create as part of ClusteringPlan. Increasing groups will increase parallelism. 
This does not imply the number of output file groups as such. This refers to 
clustering groups (parallel tasks/threads that will work towards producing 
output file groups). Total output file groups is also determined by based on 
target file size which we will discuss shortly.
   
   Max bytes per group:
   
   hoodie.clustering.plan.strategy.max.bytes.per.group: Each clustering 
operation can create multiple output file groups. Total amount of data 
processed by clustering operation is defined by below two properties (Max bytes 
per group * Max num groups. Thus, this config will assist in capping the max 
amount of data to be included in one group.
   
   Target file size max:
   
   hoodie.clustering.plan.strategy.target.file.max.bytes: Each group can 
produce ’N’ (max group size /target file size) output file groups.
   
   
   These might help trim down the amount of data to be considered for 
clustering. May be we are trying to cluster too many files at the same time. 
   
   Reference: 
https://medium.com/@simpsons/storage-optimization-with-apache-hudi-clustering-aa6e23e18e77
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #8085: [SUPPORT] deltacommit triggering criteria

2023-03-03 Thread via GitHub


nsivabalan commented on issue #8085:
URL: https://github.com/apache/hudi/issues/8085#issuecomment-1453904397

   hey hi @tatiana-rackspace :
   Deltastreamer as you might know is a streaming ingestion tool. 
   we have some source limit to consume for each batch. 
   incase fo kafka, its no of msgs. incase of DFS based sources, its number of 
bytes.
   
   you can configure the source limit using `--source-limit`. More info can be 
found here https://hudi.apache.org/docs/hoodie_deltastreamer 
   
   also, it depends on how much data was available when sync() was called. 
   lets say you have configured the min-sync-interval to 30 
mins(`--min-sync-interval-seconds`), deltastreamer will try to fetch data from 
source and sync to hudi once every 30 mins, 
   So, at t0, it will consume from source adhering to max limit you have 
configured. and then after 30 mins, it will again consume from source based on 
last checkpoint, again adhering to the source limit. 
   
   Let me know if this clarifies things. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453798833

   
   ## CI report:
   
   * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556)
 
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453744751

   
   ## CI report:
   
   * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556)
 
   * a322500ff1b38637a5efefb58d75ea83bb0dab84 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1453744033

   
   ## CI report:
   
   * e5ed02b3c18025fc3b0c5a135be64991fb43417b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15494)
 
   * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yuzhaojing commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server

2023-03-03 Thread via GitHub


yuzhaojing commented on PR #8080:
URL: https://github.com/apache/hudi/pull/8080#issuecomment-1453735396

   > @yuzhaojing @xushiyan The changes to the write client are done when 
introducing the new table service client. Before that, based on my 
understanding, the inline table services running along with the regular write 
client share the same timeline server. So I think with the new table service 
client, we should still follow the same convention. Is there anything I miss? 
When the table service manager is used, how's the interplay between the 
timeline server and the table service manager?
   > 
   > cc @nsivabalan
   > 
   > Before we fully agree on the approach here, let's not merge this PR. Also, 
I'd like to add some tests to guard around the expected behavior, after the 
discussion.
   
   @yihua @danny0405 @xushiyan I'm sorry for this serious bug. I think the 
table service client should share the same timeline server as the regular write 
client. Here I think the following tests can be added to the table service 
client: 
   
   1. Add unit tests to confirm that the table service client has not made 
unexpected modifications to writeConfig.
   2. Confirm that the table service of the table service client is scheduled 
and executed normally.
   3. Call correctly after starting the managed service.
   
   Want to hear your thoughts and apologize again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


hudi-bot commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453733568

   
   ## CI report:
   
   * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df

2023-03-03 Thread via GitHub


hudi-bot commented on PR #7951:
URL: https://github.com/apache/hudi/pull/7951#issuecomment-1453732980

   
   ## CI report:
   
   * e5ed02b3c18025fc3b0c5a135be64991fb43417b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15494)
 
   * bbf05d39a470149af7259e2ea0a69b76ebb660df UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink

2023-03-03 Thread via GitHub


danny0405 commented on PR #8070:
URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453728899

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-5847] Add support for multiple metric reporters and metric labels (#8041)

2023-03-03 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 81e6e854883 [HUDI-5847] Add support for multiple metric reporters and 
metric labels (#8041)
81e6e854883 is described below

commit 81e6e854883a94d41ae5b7187c608a8ddbc7bf35
Author: Lokesh Jain 
AuthorDate: Fri Mar 3 21:06:43 2023 +0530

[HUDI-5847] Add support for multiple metric reporters and metric labels 
(#8041)

Add support for multiple metric reporters within a MetricRegistry.
Further it also adds labels to metrics.
---
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  4 ++
 .../hudi/config/metrics/HoodieMetricsConfig.java   |  6 ++
 .../java/org/apache/hudi/metrics/MetricUtils.java  | 81 ++
 .../main/java/org/apache/hudi/metrics/Metrics.java | 54 ---
 .../hudi/metrics/MetricsReporterFactory.java   | 14 +++-
 .../hudi/metrics/datadog/DatadogHttpClient.java| 20 --
 .../metrics/datadog/DatadogMetricsReporter.java|  2 +-
 .../hudi/metrics/datadog/DatadogReporter.java  | 27 +---
 .../prometheus/PushGatewayMetricsReporter.java | 26 ++-
 .../metrics/prometheus/PushGatewayReporter.java| 42 ++-
 .../hudi/metrics/TestMetricsReporterFactory.java   |  4 +-
 .../prometheus/TestPushGateWayReporter.java| 74 +++-
 .../src/test/resources/datadog.properties  | 25 +++
 .../src/test/resources/prometheus.properties   | 24 +++
 14 files changed, 333 insertions(+), 70 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 7ce7d8c6574..886112cae16 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -2123,6 +2123,10 @@ public class HoodieWriteConfig extends HoodieConfig {
 return getStringOrDefault(HoodieMetricsConfig.METRICS_REPORTER_PREFIX);
   }
 
+  public String getMetricReporterFileBasedConfigs() {
+return 
getStringOrDefault(HoodieMetricsConfig.METRICS_REPORTER_FILE_BASED_CONFIGS_PATH);
+  }
+
   /**
* memory configs.
*/
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
index 486f1277ba7..b7f3fa1f630 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java
@@ -95,6 +95,12 @@ public class HoodieMetricsConfig extends HoodieConfig {
   .sinceVersion("0.13.0")
   .withDocumentation("Enable metrics for locking infra. Useful when 
operating in multiwriter mode");
 
+  public static final ConfigProperty 
METRICS_REPORTER_FILE_BASED_CONFIGS_PATH = ConfigProperty
+  .key(METRIC_PREFIX + ".configs.properties")
+  .defaultValue("")
+  .sinceVersion("0.14.0")
+  .withDocumentation("Comma separated list of config file paths for metric 
exporter configs");
+
   /**
* @deprecated Use {@link #TURN_METRICS_ON} and its methods instead
*/
diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java
new file mode 100644
index 000..e119760883f
--- /dev/null
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import java.util.Arrays;
+import java.util.List;
+import 

[GitHub] [hudi] codope merged pull request #8041: [HUDI-5847] Add support for multiple metric reporters and metric labels

2023-03-03 Thread via GitHub


codope merged PR #8041:
URL: https://github.com/apache/hudi/pull/8041


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #8041: [HUDI-5847] Add support for multiple metric reporters and metric labels

2023-03-03 Thread via GitHub


nsivabalan commented on PR #8041:
URL: https://github.com/apache/hudi/pull/8041#issuecomment-1453706032

   CI is green
   https://user-images.githubusercontent.com/513218/222760967-36a1b0e9-fb75-46cd-8e15-7ed373cc5b32.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-5665] Adding support to re-use table configs (#7901)

2023-03-03 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new cfe490fcb23 [HUDI-5665] Adding support to re-use table configs (#7901)
cfe490fcb23 is described below

commit cfe490fcb2333049b4f47a2d1d241b07e12d42c1
Author: Sivabalan Narayanan 
AuthorDate: Fri Mar 3 07:09:03 2023 -0800

[HUDI-5665] Adding support to re-use table configs (#7901)

- As of now, we expect users to set some of the mandatory fields in every 
write. For eg, record keys, partition path etc. These cannot change for a given 
table and gets serialized into table config. In this patch, we are adding 
support to re-use table configs. So, users can set these configs only in the 
first commit for a given table. Subsequent writes might re-use from table 
config if not explicitly set by the user.
---
 .../scala/org/apache/hudi/DataSourceOptions.scala  |  29 +++-
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  44 --
 .../scala/org/apache/hudi/HoodieWriterUtils.scala  |  22 ++-
 .../org/apache/hudi/TestHoodieSparkSqlWriter.scala |  34 ++---
 .../apache/hudi/functional/TestCOWDataSource.scala | 164 +
 .../hudi/functional/TestStreamingSource.scala  |   4 +
 6 files changed, 260 insertions(+), 37 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
index d2c8629df98..1e3c219b6c6 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
@@ -25,7 +25,7 @@ import org.apache.hudi.common.model.{HoodieTableType, 
WriteOperationType}
 import org.apache.hudi.common.table.HoodieTableConfig
 import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.common.util.{Option, StringUtils}
-import org.apache.hudi.config.{HoodieClusteringConfig, HoodieWriteConfig}
+import org.apache.hudi.config.{HoodieClusteringConfig, HoodiePayloadConfig, 
HoodieWriteConfig}
 import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncConfigHolder, 
HiveSyncTool}
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions
 import org.apache.hudi.keygen.{ComplexKeyGenerator, CustomKeyGenerator, 
NonpartitionedKeyGenerator, SimpleKeyGenerator}
@@ -830,6 +830,33 @@ object DataSourceOptionsHelper {
 translatedOpt.toMap
   }
 
+  /**
+   * Some config keys differ from what user sets and whats part of table 
Config. this method assists in fetching the
+   * right table config and populating write configs.
+   * @param tableConfig table config of interest.
+   * @param params incoming write params.
+   * @return missing params that needs to be added to incoming write params
+   */
+  def fetchMissingWriteConfigsFromTableConfig(tableConfig: HoodieTableConfig, 
params: Map[String, String]) : Map[String, String] = {
+val missingWriteConfigs = scala.collection.mutable.Map[String, String]()
+if (!params.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) && 
tableConfig.getRecordKeyFieldProp != null) {
+  missingWriteConfigs ++= 
Map(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key() -> 
tableConfig.getRecordKeyFieldProp)
+}
+if (!params.contains(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key()) 
&& tableConfig.getPartitionFieldProp != null) {
+  missingWriteConfigs ++= 
Map(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key() -> 
tableConfig.getPartitionFieldProp)
+}
+if (!params.contains(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key()) 
&& tableConfig.getKeyGeneratorClassName != null) {
+  missingWriteConfigs ++= 
Map(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> 
tableConfig.getKeyGeneratorClassName)
+}
+if (!params.contains(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key()) && 
tableConfig.getPreCombineField != null) {
+  missingWriteConfigs ++= Map(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key 
-> tableConfig.getPreCombineField)
+}
+if (!params.contains(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key()) && 
tableConfig.getPayloadClass != null) {
+  missingWriteConfigs ++= 
Map(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key() -> 
tableConfig.getPayloadClass)
+}
+missingWriteConfigs.toMap
+  }
+
   def parametersWithReadDefaults(parameters: Map[String, String]): Map[String, 
String] = {
 // First check if the ConfigUtils.IS_QUERY_AS_RO_TABLE has set by 
HiveSyncTool,
 // or else use query type from QUERY_TYPE.
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 

  1   2   >