[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454658647 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) * 10d3659dcc94bc069d0da83ee3b711bf4ff079fe Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15573) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125415237 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -181,22 +181,24 @@ private boolean isLocalViewBehind(Context ctx) { * Syncs data-set view if local view is behind. */ private boolean syncIfLocalViewBehind(Context ctx) { -if (isLocalViewBehind(ctx)) { - String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM); - String lastKnownInstantFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS); - SyncableFileSystemView view = viewManager.getFileSystemView(basePath); - synchronized (view) { -if (isLocalViewBehind(ctx)) { - HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline(); - LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient - + " as last known instant but server has the following last instant on timeline :" - + localTimeline.lastInstant()); - view.sync(); - return true; -} +boolean result = false; +String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM); +SyncableFileSystemView view = viewManager.getFileSystemView(basePath); +synchronized (view) { + if (isLocalViewBehind(ctx)) { + +String lastKnownInstantFromClient = ctx.queryParamAsClass( +RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class) +.getOrDefault(HoodieTimeline.INVALID_INSTANT_TS); +HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline(); +LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient ++ " as last known instant but server has the following last instant on timeline :" ++ localTimeline.lastInstant()); +view.sync(); +result = true; Review Comment: Good catch! The variable is not needed. Fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454657002 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) * 10d3659dcc94bc069d0da83ee3b711bf4ff079fe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125388754 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -282,6 +286,20 @@ public void reset() { } } + /** + * Resets the view states, which can be overridden by subclasses. This reset logic is guarded + * by the write lock. + * + * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic. DO NOT OVERRIDE + * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view + * to be served. + */ + protected void runReset() { Review Comment: Removed `runSync` and `runReset` methods to avoid confusion and make every implementation explicitly use write lock except remote FSV. If new file system view needs to be added, the author should look at existing implementation for reference. Renaming won't prevent the author doing the wrong thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
danny0405 commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125407040 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -181,22 +181,24 @@ private boolean isLocalViewBehind(Context ctx) { * Syncs data-set view if local view is behind. */ private boolean syncIfLocalViewBehind(Context ctx) { -if (isLocalViewBehind(ctx)) { - String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM); - String lastKnownInstantFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS); - SyncableFileSystemView view = viewManager.getFileSystemView(basePath); - synchronized (view) { -if (isLocalViewBehind(ctx)) { - HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline(); - LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient - + " as last known instant but server has the following last instant on timeline :" - + localTimeline.lastInstant()); - view.sync(); - return true; -} +boolean result = false; +String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM); +SyncableFileSystemView view = viewManager.getFileSystemView(basePath); +synchronized (view) { + if (isLocalViewBehind(ctx)) { + +String lastKnownInstantFromClient = ctx.queryParamAsClass( +RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class) +.getOrDefault(HoodieTimeline.INVALID_INSTANT_TS); +HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline(); +LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient ++ " as last known instant but server has the following last instant on timeline :" ++ localTimeline.lastInstant()); +view.sync(); +result = true; Review Comment: Can we return directly from this line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454572272 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * c162956f9f418b4603328c37f9e2babf59613d4b Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571) * 7fff406e74cdf3faf047634a2d596399fa49f059 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15572) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454564278 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564) * c162956f9f418b4603328c37f9e2babf59613d4b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571) * 7fff406e74cdf3faf047634a2d596399fa49f059 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454564173 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390671 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -269,19 +269,35 @@ public void close() { /** * Clears the partition Map and reset view states. + * + * NOTE: This method SHOULD NOT BE OVERRIDDEN which may cause stale file system view + * to be served. Instead, override {@link AbstractTableFileSystemView#runReset} to + * add custom logic. */ @Override public void reset() { try { writeLock.lock(); - clear(); - // Initialize with new Hoodie timeline. - init(metaClient, getTimeline()); + runReset(); } finally { writeLock.unlock(); } } + /** + * Resets the view states, which can be overridden by subclasses. This reset logic is guarded + * by the write lock. + * + * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic. DO NOT OVERRIDE Review Comment: No longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390520 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -269,19 +269,35 @@ public void close() { /** * Clears the partition Map and reset view states. + * + * NOTE: This method SHOULD NOT BE OVERRIDDEN which may cause stale file system view + * to be served. Instead, override {@link AbstractTableFileSystemView#runReset} to + * add custom logic. */ @Override public void reset() { Review Comment: no longer needed as discussed. We directly use the write lock in each overriding implementation instead of indirect usage. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java: ## @@ -90,14 +90,14 @@ protected Map, FileStatus[]> listPartitions(List
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125390087 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -78,6 +79,10 @@ public class RequestHandler { private final BaseFileHandler dataFileHandler; private final MarkerHandler markerHandler; private final Registry metricsRegistry = Registry.getRegistry("TimelineService"); + // This read-write lock is used for syncing the file system view if it is behind client's view Review Comment: This read-write lock is removed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125389847 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -151,30 +156,38 @@ public void stop() { * Determines if local view of table's timeline is behind that of client's view. */ private boolean isLocalViewBehind(Context ctx) { -String basePath = ctx.queryParam(RemoteHoodieTableFileSystemView.BASEPATH_PARAM); -String lastKnownInstantFromClient = -ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.LAST_INSTANT_TS, String.class).getOrDefault(HoodieTimeline.INVALID_INSTANT_TS); -String timelineHashFromClient = ctx.queryParamAsClass(RemoteHoodieTableFileSystemView.TIMELINE_HASH, String.class).getOrDefault(""); -HoodieTimeline localTimeline = - viewManager.getFileSystemView(basePath).getTimeline().filterCompletedOrMajorOrMinorCompactionInstants(); -if (LOG.isDebugEnabled()) { - LOG.debug("Client [ LastTs=" + lastKnownInstantFromClient + ", TimelineHash=" + timelineHashFromClient - + "], localTimeline=" + localTimeline.getInstants()); -} +try { + // This read lock makes sure that if the local view of the table is being synced, + // no timeline server requests should be processed or handled until the sync process Review Comment: @danny0405 This is simplified now. You can also check my updated PR description for how the race condition can happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454556422 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564) * c162956f9f418b4603328c37f9e2babf59613d4b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15571) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454556333 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) * e4c3c1dac4ae60c71219183167b491379f181ab0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454555052 > Is this a regression? what version? Can I look at the offending commit to understand how it was before. I updated the PR description to provide more detailed information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125389071 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -187,12 +200,20 @@ private boolean syncIfLocalViewBehind(Context ctx) { SyncableFileSystemView view = viewManager.getFileSystemView(basePath); synchronized (view) { if (isLocalViewBehind(ctx)) { - HoodieTimeline localTimeline = viewManager.getFileSystemView(basePath).getTimeline(); - LOG.info("Syncing view as client passed last known instant " + lastKnownInstantFromClient - + " as last known instant but server has the following last instant on timeline :" - + localTimeline.lastInstant()); - view.sync(); - return true; + try { Review Comment: As synced offline, only keeping the synchronized block now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
yihua commented on code in PR #8079: URL: https://github.com/apache/hudi/pull/8079#discussion_r1125388754 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java: ## @@ -282,6 +286,20 @@ public void reset() { } } + /** + * Resets the view states, which can be overridden by subclasses. This reset logic is guarded + * by the write lock. + * + * NOTE: This method SHOULD BE OVERRIDDEN for any custom logic. DO NOT OVERRIDE + * {@link AbstractTableFileSystemView#reset} directly, which may cause stale file system view + * to be served. + */ + protected void runReset() { Review Comment: Removed `runSync` and `runReset` methods to avoid confusion and make every implementation explicitly use write lock except remote FSV. If new file system view needs to be added, the author should look at existing implementation for reference. Renaming won't prevent them the author doing the wrong thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
danny0405 commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454547592 @hudi-bot run travis -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?
voonhous commented on issue #8071: URL: https://github.com/apache/hudi/issues/8071#issuecomment-1454525599 > Sorry for late reply, did you already use the append and it is still slow? Yeap, judging from the stack trace, he is running his job under append only mode. ```log org.apache.hudi.sink.append.AppendWriteFunction.initWriterHelper(AppendWriteFunction.java:110 ``` > Then we switched to the snappy format, and the writing performance did improve to a certain extent. However, due to the Tencent Cloud COS we used for storage, there was a list frequency control problem in cow writing, so the overall performance could not be greatly improved,and the exception is as follows: This feels like a COS issue. @DavidZ1 you mentioned `there was a list frequency control problem in cow writing`. So, it's spending too much time listing files? IIUC, your job might be running too many parquet files while flushing? I am not very familiar with COS, so I am taking a shot in the dark here, looking at your configurations, the default `write.parquet.max.file.size` is used, which is 120MB by default. Perhaps, you could try increasing this so that lesser parquet files are written? Do note that your parquet sizes will get larger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454506895 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564) * c162956f9f418b4603328c37f9e2babf59613d4b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454499436 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table
hudi-bot commented on PR #8076: URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454490473 ## CI report: * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506) * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN * f384bbc843028360687903b3b6de835685235b68 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table
hudi-bot commented on PR #8076: URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454442192 ## CI report: * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506) * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN * f384bbc843028360687903b3b6de835685235b68 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8076: Support bulk_insert for insert_overwrite and insert_overwrite_table
hudi-bot commented on PR #8076: URL: https://github.com/apache/hudi/pull/8076#issuecomment-1454433889 ## CI report: * 8432800aa63cc5e4d4384f2ade7747aff96bc1c0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15506) * 6a239ada8998fd440f19c0082b26d206ed589870 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (18d528f33d8 -> d40a6211f64)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 18d528f33d8 [HUDI-5736] Common de-coupling column drop flag and schema validation flag (#7895) add d40a6211f64 [HUDI-5796] Adding auto inferring partition from incoming df (#7951) No new revisions were added by this update. Summary of changes: .../testsuite/dag/nodes/SparkDeleteNode.scala | 2 +- .../dag/nodes/SparkDeletePartitionNode.scala | 2 +- .../testsuite/dag/nodes/SparkInsertNode.scala | 2 +- .../scala/org/apache/hudi/DataSourceOptions.scala | 45 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 2 +- .../apache/hudi/functional/TestCOWDataSource.scala | 117 +++-- 6 files changed, 136 insertions(+), 34 deletions(-)
[GitHub] [hudi] nsivabalan merged pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
nsivabalan merged PR #7951: URL: https://github.com/apache/hudi/pull/7951 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] raghavant-git commented on issue #8016: Inline Clustering : Clustering failed to write to files
raghavant-git commented on issue #8016: URL: https://github.com/apache/hudi/issues/8016#issuecomment-1454402165 thanks for the response will test the above parameters and update it here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a diff in pull request #7907: [HUDI-5672][RFC-61] Lockless multi writer support
vinothchandar commented on code in PR #7907: URL: https://github.com/apache/hudi/pull/7907#discussion_r1122538301 ## rfc/rfc-61/rfc-61.md: ## @@ -0,0 +1,98 @@ +# RFC-61: Lockless Multi Writer + +## Proposers +- @danny0405 +- @ForwardXu +- @SteNicholas + +## Approvers +- + +## Status + +JIRA: [Lockless multi writer support](https://issues.apache.org/jira/browse/HUDI-5672) + +## Abstract +As you know, Hudi already supports basic OCC with abundant lock providers. +But for multi streaming ingestion writers, the OCC does not work well because the conflicts happen in very high frequency. +Expand it a little bit, with hashing index, all the writers have deterministic hashing algorithm for distributing the records by primary keys, +all the keys are evenly distributed in all the data buckets, for a single data flushing in one writer, almost all the data buckets are appended with new inputs, +so the conflict would very possibility happen for mul-writer because almost all the data buckets are being written by multiple writers at the same time; +For bloom filter index, things are different, but remember that we have a small file load rebalance strategy to writer into the **small** bucket in higher priority, +that means, multiple writers prune to write into the same **small** buckets at the same time, that's how conflicts happen. + +In general, for multiple streaming writers ingestion, explicit lock is not very capable of putting into production, in this RFC, we propse a lockless solution for streaming ingestion. + +## Background + +Streaming jobs are naturally suitable for data ingestion, it has no complexity of pipeline orchestration and has a smother write workload. +Most of the raw data set we are handling today are generating all the time in streaming way. + +Based on that, many requests for multiple writers' ingestion are derived. With multi-writer ingestion, several streaming events with the same schema can be drained into one Hudi table, +the Hudi table kind of becomes a UNION table view for all the input data set. This is a very common use case because in reality, the data sets are usually scattered all over the data sources. + +Another very useful use case we wanna unlock is the real-time data set join. One of the biggest pain point in streaming computation is the dataset join, +the engine like Flink has basic supports for all kind of SQL JOINs, but it stores the input records within its inner state-backend which is a huge cost for pure data join with no additional computations. +In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer, +we can implement N-ways data sources join in real-time! Hudi would take care of the payload join during compaction service procedure. + +## Design + +### The Precondition + + MOR Table Type Is Required + +The table type must be `MERGE_ON_READ`, so that we can defer the conflict resolution to the compaction phase. The compaction service would resolve the conflicts of the same keys by respecting the event time sequence of the events. Review Comment: compaction or merge on read for queries. ## rfc/rfc-61/rfc-61.md: ## @@ -0,0 +1,98 @@ +# RFC-61: Lockless Multi Writer + +## Proposers +- @danny0405 +- @ForwardXu +- @SteNicholas + +## Approvers +- + +## Status + +JIRA: [Lockless multi writer support](https://issues.apache.org/jira/browse/HUDI-5672) + +## Abstract +As you know, Hudi already supports basic OCC with abundant lock providers. +But for multi streaming ingestion writers, the OCC does not work well because the conflicts happen in very high frequency. +Expand it a little bit, with hashing index, all the writers have deterministic hashing algorithm for distributing the records by primary keys, +all the keys are evenly distributed in all the data buckets, for a single data flushing in one writer, almost all the data buckets are appended with new inputs, +so the conflict would very possibility happen for mul-writer because almost all the data buckets are being written by multiple writers at the same time; +For bloom filter index, things are different, but remember that we have a small file load rebalance strategy to writer into the **small** bucket in higher priority, +that means, multiple writers prune to write into the same **small** buckets at the same time, that's how conflicts happen. + +In general, for multiple streaming writers ingestion, explicit lock is not very capable of putting into production, in this RFC, we propse a lockless solution for streaming ingestion. + +## Background + +Streaming jobs are naturally suitable for data ingestion, it has no complexity of pipeline orchestration and has a smother write workload. +Most of the raw data set we are handling today are generating all the time in streaming way. + +Based on that, many requests for multiple writers' ingestion are derived.
[GitHub] [hudi] danny0405 commented on issue #8018: [SUPPORT] why is the schema evolution done while not setting hoodie.schema.on.read.enable
danny0405 commented on issue #8018: URL: https://github.com/apache/hudi/issues/8018#issuecomment-1454372483 I guess we need a clear doc to elaborate the schema evolution details for 0.13.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-5736) De-coupling column drop flag and schema validation flag in Flink
[ https://issues.apache.org/jira/browse/HUDI-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5736. Resolution: Fixed Fixed via master branch: 18d528f33d8b1dd7a836e5543ddf36e0a9c95ad1 > De-coupling column drop flag and schema validation flag in Flink > > > Key: HUDI-5736 > URL: https://issues.apache.org/jira/browse/HUDI-5736 > Project: Apache Hudi > Issue Type: Bug > Components: flink, writer-core >Reporter: Alexander Trushev >Assignee: Alexander Trushev >Priority: Major > Labels: pull-request-available > > Fix https://issues.apache.org/jira/browse/HUDI-5704 for Flink engine -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-5736] Common de-coupling column drop flag and schema validation flag (#7895)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 18d528f33d8 [HUDI-5736] Common de-coupling column drop flag and schema validation flag (#7895) 18d528f33d8 is described below commit 18d528f33d8b1dd7a836e5543ddf36e0a9c95ad1 Author: Alexander Trushev AuthorDate: Sat Mar 4 11:00:25 2023 +0700 [HUDI-5736] Common de-coupling column drop flag and schema validation flag (#7895) * [HUDI-5736] Common de-coupling column drop flag and schema validation flag --- .../java/org/apache/hudi/table/HoodieTable.java| 40 ++--- .../hudi/table/TestHoodieMergeOnReadTable.java | 1 + .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 65 ++ .../org/apache/hudi/avro/TestAvroSchemaUtils.java | 57 +++ .../apache/hudi/sink/ITTestDataStreamWrite.java| 52 + .../resources/test_read_schema_dropped_age.avsc| 41 ++ .../org/apache/hudi/HoodieSparkSqlWriter.scala | 4 +- .../AlterHoodieTableChangeColumnCommand.scala | 2 +- .../hudi/command/MergeIntoHoodieTableCommand.scala | 3 +- 9 files changed, 242 insertions(+), 23 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java index 2a71cf4ea46..8b1056bca6c 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java @@ -18,6 +18,7 @@ package org.apache.hudi.table; +import org.apache.hudi.avro.AvroSchemaUtils; import org.apache.hudi.avro.HoodieAvroUtils; import org.apache.hudi.avro.model.HoodieCleanMetadata; import org.apache.hudi.avro.model.HoodieCleanerPlan; @@ -92,6 +93,9 @@ import org.apache.log4j.Logger; import java.io.IOException; import java.io.Serializable; import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; @@ -100,7 +104,6 @@ import java.util.function.Function; import java.util.stream.Collectors; import java.util.stream.Stream; -import static org.apache.hudi.avro.AvroSchemaUtils.isSchemaCompatible; import static org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy.EAGER; import static org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy.LAZY; import static org.apache.hudi.common.table.HoodieTableConfig.TABLE_METADATA_PARTITIONS; @@ -803,27 +806,22 @@ public abstract class HoodieTable implements Serializable { */ private void validateSchema() throws HoodieUpsertException, HoodieInsertException { -if (!shouldValidateAvroSchema() || getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) { +boolean shouldValidate = config.shouldValidateAvroSchema(); +boolean allowProjection = config.shouldAllowAutoEvolutionColumnDrop(); +if ((!shouldValidate && allowProjection) +|| getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) { // Check not required return; } -Schema tableSchema; -Schema writerSchema; -boolean isValid; try { TableSchemaResolver schemaResolver = new TableSchemaResolver(getMetaClient()); - writerSchema = HoodieAvroUtils.createHoodieWriteSchema(config.getSchema()); - tableSchema = HoodieAvroUtils.createHoodieWriteSchema(schemaResolver.getTableAvroSchema(false)); - isValid = isSchemaCompatible(tableSchema, writerSchema, config.shouldAllowAutoEvolutionColumnDrop()); + Schema writerSchema = HoodieAvroUtils.createHoodieWriteSchema(config.getSchema()); + Schema tableSchema = HoodieAvroUtils.createHoodieWriteSchema(schemaResolver.getTableAvroSchema(false)); + AvroSchemaUtils.checkSchemaCompatible(tableSchema, writerSchema, shouldValidate, allowProjection, getDropPartitionColNames()); } catch (Exception e) { throw new HoodieException("Failed to read schema/check compatibility for base path " + metaClient.getBasePath(), e); } - -if (!isValid) { - throw new HoodieException("Failed schema compatibility check for writerSchema :" + writerSchema - + ", table schema :" + tableSchema + ", base path :" + metaClient.getBasePath()); -} } public void validateUpsertSchema() throws HoodieUpsertException { @@ -1041,11 +1039,15 @@ public abstract class HoodieTable implements Serializable { return Functions.noop(); } - private boolean shouldValidateAvroSchema() { -// TODO(HUDI-4772) re-enable validations in case partition columns -// being dropped from the data-file after fixing the write schema -
[GitHub] [hudi] danny0405 merged pull request #7895: [HUDI-5736] Common de-coupling column drop flag and schema validation flag
danny0405 merged PR #7895: URL: https://github.com/apache/hudi/pull/7895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454366553 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) * e4c3c1dac4ae60c71219183167b491379f181ab0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454364773 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) * e4c3c1dac4ae60c71219183167b491379f181ab0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #7687: [HUDI-5606] Update to handle deletes in postgres debezium
danny0405 commented on PR #7687: URL: https://github.com/apache/hudi/pull/7687#issuecomment-1454364001 Reviewing now, can you add some test cases for the payload thing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454363083 ## CI report: * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15561) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454363052 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15566) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc
danny0405 commented on issue #8092: URL: https://github.com/apache/hudi/issues/8092#issuecomment-1454362530 Thanks, can you fire a PR to asf-site branch and fix that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc
danny0405 closed issue #8092: [SUPPORT] Spell Mistake on Hudi Configurations Doc URL: https://github.com/apache/hudi/issues/8092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8071: [SUPPORT]How to improve the speed of Flink writing to hudi ?
danny0405 commented on issue #8071: URL: https://github.com/apache/hudi/issues/8071#issuecomment-1454362050 Sorry for late reply, did you already use the append and it is still slow? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-5812) Optimize the data size check in HoodieBaseParquetWriter
[ https://issues.apache.org/jira/browse/HUDI-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen resolved HUDI-5812. -- > Optimize the data size check in HoodieBaseParquetWriter > --- > > Key: HUDI-5812 > URL: https://issues.apache.org/jira/browse/HUDI-5812 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5812) Optimize the data size check in HoodieBaseParquetWriter
[ https://issues.apache.org/jira/browse/HUDI-5812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-5812. Fix Version/s: 0.13.1 0.14.0 Resolution: Fixed Fixed via master branch: 2a52bc03d90d88c518d5ab377dc01e717813522b > Optimize the data size check in HoodieBaseParquetWriter > --- > > Key: HUDI-5812 > URL: https://issues.apache.org/jira/browse/HUDI-5812 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > Fix For: 0.13.1, 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter (#7978)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 2a52bc03d90 [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter (#7978) 2a52bc03d90 is described below commit 2a52bc03d90d88c518d5ab377dc01e717813522b Author: Rex(Hui) An AuthorDate: Sat Mar 4 11:36:06 2023 +0800 [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter (#7978) Use a exponentially elastic algorithm to probe the .canWrite flag. --- .../hudi/io/storage/HoodieBaseParquetWriter.java | 38 +-- .../io/storage/TestHoodieBaseParquetWriter.java| 122 + 2 files changed, 150 insertions(+), 10 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java index e38b41d422a..a82c26bae92 100644 --- a/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java +++ b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieBaseParquetWriter.java @@ -21,6 +21,7 @@ package org.apache.hudi.io.storage; import org.apache.hadoop.fs.Path; import org.apache.hudi.common.fs.FSUtils; import org.apache.hudi.common.fs.HoodieWrapperFileSystem; +import org.apache.hudi.common.util.VisibleForTesting; import org.apache.parquet.hadoop.ParquetFileWriter; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.api.WriteSupport; @@ -28,6 +29,9 @@ import org.apache.parquet.hadoop.api.WriteSupport; import java.io.IOException; import java.util.concurrent.atomic.AtomicLong; +import static org.apache.parquet.column.ParquetProperties.DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK; +import static org.apache.parquet.column.ParquetProperties.DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK; + /** * Base class of Hudi's custom {@link ParquetWriter} implementations * @@ -36,11 +40,9 @@ import java.util.concurrent.atomic.AtomicLong; */ public abstract class HoodieBaseParquetWriter extends ParquetWriter { - private static final int WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK = 1000; - private final AtomicLong writtenRecordCount = new AtomicLong(0); private final long maxFileSize; - private long lastCachedDataSize = -1; + private long recordCountForNextSizeCheck; public HoodieBaseParquetWriter(Path file, HoodieParquetConfig> parquetConfig) throws IOException { @@ -62,17 +64,28 @@ public abstract class HoodieBaseParquetWriter extends ParquetWriter { // stream and the actual file size reported by HDFS this.maxFileSize = parquetConfig.getMaxFileSize() + Math.round(parquetConfig.getMaxFileSize() * parquetConfig.getCompressionRatio()); +this.recordCountForNextSizeCheck = DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK; } public boolean canWrite() { -// TODO we can actually do evaluation more accurately: -// if we cache last data size check, since we account for how many records -// were written we can accurately project avg record size, and therefore -// estimate how many more records we can write before cut off -if (lastCachedDataSize == -1 || getWrittenRecordCount() % WRITTEN_RECORDS_THRESHOLD_FOR_FILE_SIZE_CHECK == 0) { - lastCachedDataSize = getDataSize(); +long writtenCount = getWrittenRecordCount(); +if (writtenCount >= recordCountForNextSizeCheck) { + long dataSize = getDataSize(); + // In some very extreme cases, like all records are same value, then it's possible + // the dataSize is much lower than the writtenRecordCount(high compression ratio), + // causing avgRecordSize to 0, we'll force the avgRecordSize to 1 for such cases. + long avgRecordSize = Math.max(dataSize / writtenCount, 1); + // Follow the parquet block size check logic here, return false + // if it is within ~2 records of the limit + if (dataSize > (maxFileSize - avgRecordSize * 2)) { +return false; + } + recordCountForNextSizeCheck = writtenCount + Math.min( + // Do check it in the halfway + Math.max(DEFAULT_MINIMUM_RECORD_COUNT_FOR_CHECK, (maxFileSize / avgRecordSize - writtenCount) / 2), + DEFAULT_MAXIMUM_RECORD_COUNT_FOR_CHECK); } -return lastCachedDataSize < maxFileSize; +return true; } @Override @@ -84,4 +97,9 @@ public abstract class HoodieBaseParquetWriter extends ParquetWriter { protected long getWrittenRecordCount() { return writtenRecordCount.get(); } + + @VisibleForTesting + protected long getRecordCountForNextSizeCheck() { +return recordCountForNextSizeCheck; + } } diff --git a/hudi-common/src/test/java/org/apache/hudi/io/storage/TestHoodieBaseParquetWriter.java
[GitHub] [hudi] danny0405 merged pull request #7978: [HUDI-5812] Optimize the data size check in HoodieBaseParquetWriter
danny0405 merged PR #7978: URL: https://github.com/apache/hudi/pull/7978 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
danny0405 commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454358726 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server
danny0405 commented on PR #8080: URL: https://github.com/apache/hudi/pull/8080#issuecomment-1454358368 > > @yuzhaojing @xushiyan The changes to the write client are done when introducing the new table service client. Before that, based on my understanding, the inline table services running along with the regular write client share the same timeline server. So I think with the new table service client, we should still follow the same convention. Is there anything I miss? When the table service manager is used, how's the interplay between the timeline server and the table service manager? > > cc @nsivabalan > > Before we fully agree on the approach here, let's not merge this PR. Also, I'd like to add some tests to guard around the expected behavior, after the discussion. > > @yihua @danny0405 @xushiyan I'm sorry for this serious bug. I think the table service client should share the same timeline server as the regular write client. Here I think the following tests can be added to the table service client: > > 1. Add unit tests to confirm that the table service client has not made unexpected modifications to writeConfig. > 2. Confirm that the table service of the table service client is scheduled and executed normally. > 3. Call correctly after starting the managed service. > > Want to hear your thoughts and apologize again! Yeah, we need some basic UT for the service client. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454346850 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * a3062bb83dc4bdbdc39bb3ff4a5c612b2cb5401d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15514) * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15564) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8079: [HUDI-5863] Fix HoodieMetadataFileSystemView serving stale view at the timeline server
hudi-bot commented on PR #8079: URL: https://github.com/apache/hudi/pull/8079#issuecomment-1454344932 ## CI report: * 103f3efa119c4de262544fd1ee412c5375bf55cf UNKNOWN * a3062bb83dc4bdbdc39bb3ff4a5c612b2cb5401d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15514) * 9c8cfc00357c28e105e3fdeb33e3fbee3f384103 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7836: [Q] get history of a given record?
nsivabalan commented on issue #7836: URL: https://github.com/apache/hudi/issues/7836#issuecomment-1454337243 hey @meeting90 : if you question is resolved, can you close out the issue. if not, let us know how else we can help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table
nsivabalan commented on issue #7800: URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336739 but as far as trimming down the no of files, we don't have any automatic support as of now. but will be working on it. if you are interested to work on it, let us know. we can guide you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7800: [SUPPORT] "java.lang.OutOfMemoryError: Requested array size exceeds VM limit" while writing to Hudi COW table
nsivabalan commented on issue #7800: URL: https://github.com/apache/hudi/issues/7800#issuecomment-1454336475 hey @phani482 sorry for the late turn aorund. Have you enabled sync by any chance? recently we found an issue where meta sync is loading the archival timeline unnecessarily. https://github.com/apache/hudi/pull/7561 If you can try w/ 0.13.0 and let us know what do you see, would be nice. or you can cherry-pick this commit into your internal fork if you have one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7829: [SUPPORT] Using monotonically_increasing_id to generate record key causing duplicates on upsert
nsivabalan commented on issue #7829: URL: https://github.com/apache/hudi/issues/7829#issuecomment-1454335343 I might know why this could be happening. if you can clarify something, we can confirm. for a given df, while generating the primary key using monotonically increasing func, if we call the key generation twice, it could return diff keys right? just that spark will ensure they are unqiue. but it may not be the same? bcoz, down the line, our upsert partitioner is based on the hash of the record key. so, if for one of the spark partitions, if spark dag is re-triggered, chances that re-attempt of primary key generation could result in a new set of keys (whose hash value) might differ compared to first time, you might see duplicates or data loss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7897: [SUPPORT]the compaction of the MOR hudi table keeps the old values
nsivabalan commented on issue #7897: URL: https://github.com/apache/hudi/issues/7897#issuecomment-1454333905 hey @menna224 : let me clarify something and then will ask some clarification. Commit1: Key1, val1 : file1_v1.parquet. Commit2: key2, val2: file1_v2.parquet both file1_v1 and file1_v2 belongs to same file group. When you do read query, hudi will only read file1_v2.parquet. this is due to small file handling. Cleaner when its get executed later, will clean up file1_v1.parquet. but once file1_v2.parquet is created, none of your snapshot queries will read from file1_v1. Commit3: key3, val3.: again due to small file handling, file1_v3.parquet. Commit4: key3, val4 (same key as before, but an update) Hudi will add a log file to file1 (file group). So, on disk its file1_v3.parquet and log_file1.parquet. with rt, hudi will read both of them, merge and server. incase of ro, hudi will read just file1_v3.parquet. Lets say, we keep adding more updates for key3. more log files will be added. once compaction kicks in, a new parquet file will be created file1_v4.parquet (which is a merged version of file1_v3 + all associated log files). Can you clarify whats the issue you are seeing. your example wasn't very clear for me. esply on these statements. ``` then after the 10th update where i changed the name to "joe", I can see 10 log files, and only 1 parquet file, the parquet file that is kept is the last one (file3.parquet) with the old values not the updates ones: (id=3,name=mg) (id=4,name=sa) (id=5,name=john) and file1.parquet were delted. rt table contained the right values (the three records and the last record has a value joe for the coloum name) ro contained the values that's in the parquet ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454306233 ## CI report: * d5333e95b609d585c00404c55151830108dd160c Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560) * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15561) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454303245 ## CI report: * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559) * d5333e95b609d585c00404c55151830108dd160c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560) * a3473633e6456cf3d6ee4e4dfc34f98250bdff17 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454240600 ## CI report: * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559) * d5333e95b609d585c00404c55151830108dd160c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15560) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454222596 ## CI report: * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559) * d5333e95b609d585c00404c55151830108dd160c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454180148 ## CI report: * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15559) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2
hudi-bot commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1454173948 ## CI report: * 61dda6da1e111009d968f3af1735f56b43181be7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15542) * 47cd0ff13e3bce77194c4c85bf3c9ec6ac190c1d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454173661 ## CI report: * 9ae7b06b3f38d34875349f98d5e64390ab6d60db Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15558) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454064985 ## CI report: * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) * 9ae7b06b3f38d34875349f98d5e64390ab6d60db Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15558) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454056652 ## CI report: * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) * 9ae7b06b3f38d34875349f98d5e64390ab6d60db UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-5840) [DOCS] Add spark procedures do docs
[ https://issues.apache.org/jira/browse/HUDI-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696292#comment-17696292 ] kazdy commented on HUDI-5840: - closing as there's PR open for it already: https://github.com/apache/hudi/pull/8004 > [DOCS] Add spark procedures do docs > --- > > Key: HUDI-5840 > URL: https://issues.apache.org/jira/browse/HUDI-5840 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kazdy >Assignee: kazdy >Priority: Minor > > Add spark procedures do docs, most are missing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-5840) [DOCS] Add spark procedures do docs
[ https://issues.apache.org/jira/browse/HUDI-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kazdy closed HUDI-5840. --- Resolution: Duplicate > [DOCS] Add spark procedures do docs > --- > > Key: HUDI-5840 > URL: https://issues.apache.org/jira/browse/HUDI-5840 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kazdy >Assignee: kazdy >Priority: Minor > > Add spark procedures do docs, most are missing -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on issue #7906: [SUPPORT] compaction error - Avro field '_hoodie_operation' not found
nsivabalan commented on issue #7906: URL: https://github.com/apache/hudi/issues/7906#issuecomment-1454047912 @danny0405 @bhasudha : do we need an FAQ or trouble shooting guide entry around this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7909: Failed to create Marker file
nsivabalan commented on issue #7909: URL: https://github.com/apache/hudi/issues/7909#issuecomment-1454047432 @koochiswathiTR : any updates on this end. If the issue got resolved, can you please close it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7910: [SUPPORT]
nsivabalan commented on issue #7910: URL: https://github.com/apache/hudi/issues/7910#issuecomment-1454046650 You can also check https://medium.com/@simpsons/apache-hudis-small-file-management-17d8c61b20e6 for reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1454046521 ## CI report: * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7910: [SUPPORT]
nsivabalan commented on issue #7910: URL: https://github.com/apache/hudi/issues/7910#issuecomment-1454046187 Is its a COW or MOR table? COW: if you look at S3 directly, you might find older files too. Hudi after rewriting to a newer version of the base file, will not delete the older file immediately. Cleaner will take care of it. But your queries/reader will only read the latest version of the data file. But if you w/ MOR table, its more nuanced. By default only one file group (w/o any log files) are considered for small file bin packing. If you wish more files to be picked up, you can try tweaking https://hudi.apache.org/docs/configurations/#hoodiemergesmallfilegroupcandidateslimit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1454046221 ## CI report: * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] GallonREX commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
GallonREX commented on issue #7925: URL: https://github.com/apache/hudi/issues/7925#issuecomment-1454041406 这是自动回复。谢谢您的邮件,您的邮件我已收到,我将尽快回复您。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7925: [SUPPORT]hudi 0.8 upgrade to hudi 0.12 report java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
nsivabalan commented on issue #7925: URL: https://github.com/apache/hudi/issues/7925#issuecomment-1454040912 Generally multi-writering capablity means, both the writer can write concurrently only if they don't have overlapping data being ingested. for eg, if both are ingesting to two different partitions completely. If not, hudi may not be able to resolve the winner and hence will abort/fail one of the writer. Its expected. Can you clarify if two writers are writing non-overlapping data and still results in concurrent modification exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7960: [SUPPORT]
nsivabalan commented on issue #7960: URL: https://github.com/apache/hudi/issues/7960#issuecomment-1454031910 yeah. you need to set ` --source-ordering-field` as well. which is equivalent to preCombine field if you were go ingest via spark data source writer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7990: [SUPPORT]Is It possible to update hudi table with data that having fewer columns
nsivabalan commented on issue #7990: URL: https://github.com/apache/hudi/issues/7990#issuecomment-145400 thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #7990: [SUPPORT]Is It possible to update hudi table with data that having fewer columns
nsivabalan closed issue #7990: [SUPPORT]Is It possible to update hudi table with data that having fewer columns URL: https://github.com/apache/hudi/issues/7990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.
nsivabalan commented on issue #7991: URL: https://github.com/apache/hudi/issues/7991#issuecomment-1454007979 can you clarify something. what exactly is your hudi table base path? `/data/testfolder` is it `data` or is it `/data/testfolder`? Hudi will not do any list operations for parent of hudi table base path. But if you have other non hudi folders within hudi table base path, it could try to list those folders. Depends on whether you have metedata enabled or not. But if you can clarify whats the base path and your findings on high no of LIST calls for which dir, we can go from there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7991: Higher number of S3 HEAD requests, while writing data to S3.
nsivabalan commented on issue #7991: URL: https://github.com/apache/hudi/issues/7991#issuecomment-145340 We fixed an issue w/ hive sync loading archived timeline unnecessarily https://github.com/apache/hudi/pull/7561 with 0.13.0, it should not be the case anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #7996: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieIOException: IOException when reading logblock from log file HoodieLogFile{pathStr='s3://dataho
nsivabalan commented on issue #7996: URL: https://github.com/apache/hudi/issues/7996#issuecomment-1453996107 Actually we fixed something on this end recently. https://github.com/apache/hudi/pull/7561 Can you try 0.13.0. We expect it should get fixed. Or you can pull this patch to your internal fork if you maintain one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8036: [SUPPORT] Migration of hudi tables encountered issue related to metadata column
nsivabalan commented on issue #8036: URL: https://github.com/apache/hudi/issues/8036#issuecomment-1453987238 good question. Depending on what sql tool you might use, you can try to explore how to select all columns except a few. then, you can ignore the hoodie meta columns explicitly in your insert into statement. For eg, for spark sql, you can do the following spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true") #select all columns except a,b sql("select `(a|b)?+.+` from tmp").show() #+---+---+ #| id| c| #+---+---+ #| 1| 4| #+---+---+ Ref: https://stackoverflow.com/questions/63127263/how-to-select-all-columns-except-2-of-them-from-a-large-table-on-pyspark-sql Hive: https://stackoverflow.com/questions/51227890/hive-how-to-select-all-but-one-column -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-5875) Fix index look/matching records for MERGE INTO with MOR table
sivabalan narayanan created HUDI-5875: - Summary: Fix index look/matching records for MERGE INTO with MOR table Key: HUDI-5875 URL: https://issues.apache.org/jira/browse/HUDI-5875 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: sivabalan narayanan MERGE INTO statement w/ MOR table might result go wrong in some corner case. where a record is valid as per base file, but has gotten a delete in log files, following this, if a user executes below MERGE_INTO statement {code:java} merge into hudi_table2 using (select * from source) as b on (hudi_table2.id = b.id and hudi_table2.name=b.name) when not matched then insert *; {code} In this case, a record that was deleted in log file, might appear as though its valid record w/ our index look up. This will not be an issue w/ COW table or after compaction kicks in. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on issue #8034: [SUPPORT]merge into didn`t reinsert the delete record
nsivabalan commented on issue #8034: URL: https://github.com/apache/hudi/issues/8034#issuecomment-1453971798 Created a ticket https://issues.apache.org/jira/browse/HUDI-5875 to follow up. This will not be an issue w/ COW table or after compaction kicks in for the file group of interest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8034: [SUPPORT]merge into didn`t reinsert the delete record
nsivabalan commented on issue #8034: URL: https://github.com/apache/hudi/issues/8034#issuecomment-1453965150 I can explain whats happening under the hood. not sure how we can fix it properly. Might need to think deep. After step 8 above, delete of id=1 goes into a log file in hudi_table2. So, if you do a sanpshot read from table2, you will not see id=1 record. But if you do an index look up, it might show as though id=1 belongs to hudi_table2 untill compaction kicks in. So, during step9, the merge into results in an index lookup (when not matched), both id=1 and id=2 are seen as valid records from hudi_table2. and so it does not re-insert anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8031: [SUPPORT] Hudi Timestamp Based Key Generator Need Assistance
nsivabalan commented on issue #8031: URL: https://github.com/apache/hudi/issues/8031#issuecomment-1453950617 ``` import java.sql.Timestamp import spark.implicits._ val df = Seq( (1, Timestamp.valueOf("2014-01-01 23:00:01"), "abc"), (1, Timestamp.valueOf("2014-11-30 12:40:32"), "abc"), (2, Timestamp.valueOf("2016-12-29 09:54:00"), "def"), (2, Timestamp.valueOf("2016-05-09 10:12:43"), "def") ).toDF("typeId","eventTime", "str") import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.hudi.common.model.HoodieRecord df.write.format("hudi"). option("hoodie.insert.shuffle.parallelism", "2"). option("hoodie.upsert.shuffle.parallelism", "2"). option("hoodie.datasource.write.precombine.field", "typeId"). option("hoodie.datasource.write.partitionpath.field", "eventTime"). option("hoodie.datasource.write.recordkey.field", "str"). option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type","DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat","-MM-dd hh:mm:ss"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat","-MM-dd"). option("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true"). option("hoodie.table.name", "hudi_tbl"). mode(Overwrite). save("/tmp/hudi_tbl_trial/") ``` ls of base path ``` ls -ltr /tmp/hudi_tbl_trial/ total 0 drwxr-xr-x 6 nsb wheel 192 Mar 3 10:40 2016-12-30 drwxr-xr-x 6 nsb wheel 192 Mar 3 10:40 2014-01-02 drwxr-xr-x 6 nsb wheel 192 Mar 3 10:40 2016-05-10 drwxr-xr-x 6 nsb wheel 192 Mar 3 10:40 2014-12-01 ``` If you prefer slash encoded ``` option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd") ``` but dir will be 3 level deep ``` ls -ltr /tmp/hudi_tbl_trial/ total 0 drwxr-xr-x 4 nsb wheel 128 Mar 3 10:42 2014 drwxr-xr-x 4 nsb wheel 128 Mar 3 10:42 2016 nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/ total 0 drwxr-xr-x 3 nsb wheel 96 Mar 3 10:42 01 drwxr-xr-x 3 nsb wheel 96 Mar 3 10:42 12 nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/01/ total 0 drwxr-xr-x 6 nsb wheel 192 Mar 3 10:42 02 nsb$ ls -ltr /tmp/hudi_tbl_trial/2014/01/02/ total 856 -rw-r--r-- 1 nsb wheel 434759 Mar 3 10:42 b02e5e6f-9d28-42d1-b257-3728e534d477-0_3-49-76_20230303104246958.parquet ``` Guess you were missing option("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true"). https://hudi.apache.org/docs/configurations/#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled-1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #8021: [DOC] Adding new stored procedures and brief documentation to the Hudi website
nsivabalan closed issue #8021: [DOC] Adding new stored procedures and brief documentation to the Hudi website URL: https://github.com/apache/hudi/issues/8021 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8021: [DOC] Adding new stored procedures and brief documentation to the Hudi website
nsivabalan commented on issue #8021: URL: https://github.com/apache/hudi/issues/8021#issuecomment-1453926176 sure @kazdy . that would be really great. Do you think you can add examples when you put up one. that would definitely benefit the community. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8025: Found commits after time :20230220161017756, please rollback greater commits first
nsivabalan commented on issue #8025: URL: https://github.com/apache/hudi/issues/8025#issuecomment-1453925203 we also made some fix on rolling back a completed instant https://github.com/apache/hudi/pull/6313. can you try 0.12.1 may be. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8025: Found commits after time :20230220161017756, please rollback greater commits first
nsivabalan commented on issue #8025: URL: https://github.com/apache/hudi/issues/8025#issuecomment-1453922536 can you post the contents of ".hoodie" w/ last mod time intact (ls -ltr). Also, when you triggered rollback via cli, whats the entire command you passed. I see we have an option `--rollbackUsingMarkers`. did you set it or no ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8016: Inline Clustering : Clustering failed to write to files
nsivabalan commented on issue #8016: URL: https://github.com/apache/hudi/issues/8016#issuecomment-1453917024 Please check out these properties. Max num groups: hoodie.clustering.plan.strategy.max.num.groups: Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism. This does not imply the number of output file groups as such. This refers to clustering groups (parallel tasks/threads that will work towards producing output file groups). Total output file groups is also determined by based on target file size which we will discuss shortly. Max bytes per group: hoodie.clustering.plan.strategy.max.bytes.per.group: Each clustering operation can create multiple output file groups. Total amount of data processed by clustering operation is defined by below two properties (Max bytes per group * Max num groups. Thus, this config will assist in capping the max amount of data to be included in one group. Target file size max: hoodie.clustering.plan.strategy.target.file.max.bytes: Each group can produce ’N’ (max group size /target file size) output file groups. These might help trim down the amount of data to be considered for clustering. May be we are trying to cluster too many files at the same time. Reference: https://medium.com/@simpsons/storage-optimization-with-apache-hudi-clustering-aa6e23e18e77 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #8085: [SUPPORT] deltacommit triggering criteria
nsivabalan commented on issue #8085: URL: https://github.com/apache/hudi/issues/8085#issuecomment-1453904397 hey hi @tatiana-rackspace : Deltastreamer as you might know is a streaming ingestion tool. we have some source limit to consume for each batch. incase fo kafka, its no of msgs. incase of DFS based sources, its number of bytes. you can configure the source limit using `--source-limit`. More info can be found here https://hudi.apache.org/docs/hoodie_deltastreamer also, it depends on how much data was available when sync() was called. lets say you have configured the min-sync-interval to 30 mins(`--min-sync-interval-seconds`), deltastreamer will try to fetch data from source and sync to hudi once every 30 mins, So, at t0, it will consume from source adhering to max limit you have configured. and then after 30 mins, it will again consume from source based on last checkpoint, again adhering to the source limit. Let me know if this clarifies things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453798833 ## CI report: * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556) * a322500ff1b38637a5efefb58d75ea83bb0dab84 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15557) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453744751 ## CI report: * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556) * a322500ff1b38637a5efefb58d75ea83bb0dab84 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1453744033 ## CI report: * e5ed02b3c18025fc3b0c5a135be64991fb43417b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15494) * bbf05d39a470149af7259e2ea0a69b76ebb660df Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yuzhaojing commented on pull request #8080: [HUDI-5865] Fix table service client to instantiate with timeline server
yuzhaojing commented on PR #8080: URL: https://github.com/apache/hudi/pull/8080#issuecomment-1453735396 > @yuzhaojing @xushiyan The changes to the write client are done when introducing the new table service client. Before that, based on my understanding, the inline table services running along with the regular write client share the same timeline server. So I think with the new table service client, we should still follow the same convention. Is there anything I miss? When the table service manager is used, how's the interplay between the timeline server and the table service manager? > > cc @nsivabalan > > Before we fully agree on the approach here, let's not merge this PR. Also, I'd like to add some tests to guard around the expected behavior, after the discussion. @yihua @danny0405 @xushiyan I'm sorry for this serious bug. I think the table service client should share the same timeline server as the regular write client. Here I think the following tests can be added to the table service client: 1. Add unit tests to confirm that the table service client has not made unexpected modifications to writeConfig. 2. Confirm that the table service of the table service client is scheduled and executed normally. 3. Call correctly after starting the managed service. Want to hear your thoughts and apologize again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
hudi-bot commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453733568 ## CI report: * 527bf25f52f47cb92fe6b0ae0e9c1b93da7ab5a9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15548) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15556) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #7951: [HUDI-5796] Adding auto inferring partition from incoming df
hudi-bot commented on PR #7951: URL: https://github.com/apache/hudi/pull/7951#issuecomment-1453732980 ## CI report: * e5ed02b3c18025fc3b0c5a135be64991fb43417b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15494) * bbf05d39a470149af7259e2ea0a69b76ebb660df UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8070: [HUDI-4372] Enable metadata table by default for flink
danny0405 commented on PR #8070: URL: https://github.com/apache/hudi/pull/8070#issuecomment-1453728899 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-5847] Add support for multiple metric reporters and metric labels (#8041)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 81e6e854883 [HUDI-5847] Add support for multiple metric reporters and metric labels (#8041) 81e6e854883 is described below commit 81e6e854883a94d41ae5b7187c608a8ddbc7bf35 Author: Lokesh Jain AuthorDate: Fri Mar 3 21:06:43 2023 +0530 [HUDI-5847] Add support for multiple metric reporters and metric labels (#8041) Add support for multiple metric reporters within a MetricRegistry. Further it also adds labels to metrics. --- .../org/apache/hudi/config/HoodieWriteConfig.java | 4 ++ .../hudi/config/metrics/HoodieMetricsConfig.java | 6 ++ .../java/org/apache/hudi/metrics/MetricUtils.java | 81 ++ .../main/java/org/apache/hudi/metrics/Metrics.java | 54 --- .../hudi/metrics/MetricsReporterFactory.java | 14 +++- .../hudi/metrics/datadog/DatadogHttpClient.java| 20 -- .../metrics/datadog/DatadogMetricsReporter.java| 2 +- .../hudi/metrics/datadog/DatadogReporter.java | 27 +--- .../prometheus/PushGatewayMetricsReporter.java | 26 ++- .../metrics/prometheus/PushGatewayReporter.java| 42 ++- .../hudi/metrics/TestMetricsReporterFactory.java | 4 +- .../prometheus/TestPushGateWayReporter.java| 74 +++- .../src/test/resources/datadog.properties | 25 +++ .../src/test/resources/prometheus.properties | 24 +++ 14 files changed, 333 insertions(+), 70 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index 7ce7d8c6574..886112cae16 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -2123,6 +2123,10 @@ public class HoodieWriteConfig extends HoodieConfig { return getStringOrDefault(HoodieMetricsConfig.METRICS_REPORTER_PREFIX); } + public String getMetricReporterFileBasedConfigs() { +return getStringOrDefault(HoodieMetricsConfig.METRICS_REPORTER_FILE_BASED_CONFIGS_PATH); + } + /** * memory configs. */ diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java index 486f1277ba7..b7f3fa1f630 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java @@ -95,6 +95,12 @@ public class HoodieMetricsConfig extends HoodieConfig { .sinceVersion("0.13.0") .withDocumentation("Enable metrics for locking infra. Useful when operating in multiwriter mode"); + public static final ConfigProperty METRICS_REPORTER_FILE_BASED_CONFIGS_PATH = ConfigProperty + .key(METRIC_PREFIX + ".configs.properties") + .defaultValue("") + .sinceVersion("0.14.0") + .withDocumentation("Comma separated list of config file paths for metric exporter configs"); + /** * @deprecated Use {@link #TURN_METRICS_ON} and its methods instead */ diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java new file mode 100644 index 000..e119760883f --- /dev/null +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricUtils.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.common.util.collection.Pair; +import java.util.Arrays; +import java.util.List; +import
[GitHub] [hudi] codope merged pull request #8041: [HUDI-5847] Add support for multiple metric reporters and metric labels
codope merged PR #8041: URL: https://github.com/apache/hudi/pull/8041 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #8041: [HUDI-5847] Add support for multiple metric reporters and metric labels
nsivabalan commented on PR #8041: URL: https://github.com/apache/hudi/pull/8041#issuecomment-1453706032 CI is green https://user-images.githubusercontent.com/513218/222760967-36a1b0e9-fb75-46cd-8e15-7ed373cc5b32.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-5665] Adding support to re-use table configs (#7901)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new cfe490fcb23 [HUDI-5665] Adding support to re-use table configs (#7901) cfe490fcb23 is described below commit cfe490fcb2333049b4f47a2d1d241b07e12d42c1 Author: Sivabalan Narayanan AuthorDate: Fri Mar 3 07:09:03 2023 -0800 [HUDI-5665] Adding support to re-use table configs (#7901) - As of now, we expect users to set some of the mandatory fields in every write. For eg, record keys, partition path etc. These cannot change for a given table and gets serialized into table config. In this patch, we are adding support to re-use table configs. So, users can set these configs only in the first commit for a given table. Subsequent writes might re-use from table config if not explicitly set by the user. --- .../scala/org/apache/hudi/DataSourceOptions.scala | 29 +++- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 44 -- .../scala/org/apache/hudi/HoodieWriterUtils.scala | 22 ++- .../org/apache/hudi/TestHoodieSparkSqlWriter.scala | 34 ++--- .../apache/hudi/functional/TestCOWDataSource.scala | 164 + .../hudi/functional/TestStreamingSource.scala | 4 + 6 files changed, 260 insertions(+), 37 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala index d2c8629df98..1e3c219b6c6 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala @@ -25,7 +25,7 @@ import org.apache.hudi.common.model.{HoodieTableType, WriteOperationType} import org.apache.hudi.common.table.HoodieTableConfig import org.apache.hudi.common.util.ValidationUtils.checkState import org.apache.hudi.common.util.{Option, StringUtils} -import org.apache.hudi.config.{HoodieClusteringConfig, HoodieWriteConfig} +import org.apache.hudi.config.{HoodieClusteringConfig, HoodiePayloadConfig, HoodieWriteConfig} import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncConfigHolder, HiveSyncTool} import org.apache.hudi.keygen.constant.KeyGeneratorOptions import org.apache.hudi.keygen.{ComplexKeyGenerator, CustomKeyGenerator, NonpartitionedKeyGenerator, SimpleKeyGenerator} @@ -830,6 +830,33 @@ object DataSourceOptionsHelper { translatedOpt.toMap } + /** + * Some config keys differ from what user sets and whats part of table Config. this method assists in fetching the + * right table config and populating write configs. + * @param tableConfig table config of interest. + * @param params incoming write params. + * @return missing params that needs to be added to incoming write params + */ + def fetchMissingWriteConfigsFromTableConfig(tableConfig: HoodieTableConfig, params: Map[String, String]) : Map[String, String] = { +val missingWriteConfigs = scala.collection.mutable.Map[String, String]() +if (!params.contains(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) && tableConfig.getRecordKeyFieldProp != null) { + missingWriteConfigs ++= Map(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key() -> tableConfig.getRecordKeyFieldProp) +} +if (!params.contains(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key()) && tableConfig.getPartitionFieldProp != null) { + missingWriteConfigs ++= Map(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key() -> tableConfig.getPartitionFieldProp) +} +if (!params.contains(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key()) && tableConfig.getKeyGeneratorClassName != null) { + missingWriteConfigs ++= Map(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> tableConfig.getKeyGeneratorClassName) +} +if (!params.contains(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key()) && tableConfig.getPreCombineField != null) { + missingWriteConfigs ++= Map(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key -> tableConfig.getPreCombineField) +} +if (!params.contains(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key()) && tableConfig.getPayloadClass != null) { + missingWriteConfigs ++= Map(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key() -> tableConfig.getPayloadClass) +} +missingWriteConfigs.toMap + } + def parametersWithReadDefaults(parameters: Map[String, String]): Map[String, String] = { // First check if the ConfigUtils.IS_QUERY_AS_RO_TABLE has set by HiveSyncTool, // or else use query type from QUERY_TYPE. diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index