[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…
hudi-bot commented on PR #6625: URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240256644 ## CI report: * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6525: [HUDI-4237] should not sync partition parameters when create non-partition table in spark
alexeykudinkin commented on PR #6525: URL: https://github.com/apache/hudi/pull/6525#issuecomment-1240241085 Approved already. @nsivabalan can you please help landing this one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance
alexeykudinkin commented on PR #6046: URL: https://github.com/apache/hudi/pull/6046#issuecomment-1240240348 @boneanxs will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi
hudi-bot commented on PR #6502: URL: https://github.com/apache/hudi/pull/6502#issuecomment-1240221361 ## CI report: * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956) * 18fd090f5b6ea14f970a315788372df5acac7939 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11135) * ccb8f60b4f1280ce8935d5713d03d6d9e0eac8fb Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11241) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath
hudi-bot commented on PR #6631: URL: https://github.com/apache/hudi/pull/6631#issuecomment-1240221583 ## CI report: * e8e8c4d8047b5985764f7534bd84e82763c3ad28 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11243) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
hudi-bot commented on PR #5478: URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240220669 ## CI report: * 7a9f87cb94043c2447da84ff07ff93009c891174 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11214) * 07f8c3922c20d3350a21ead05f0104ba57af0092 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11242) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath
[ https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4810: - Labels: pull-request-available (was: ) > Fix Hudi bundles requiring log4j2 on the classpath > -- > > Key: HUDI-4810 > URL: https://issues.apache.org/jira/browse/HUDI-4810 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > As part of addressing HUDI-4441, we've erroneously rebased Hudi onto > "log4j-1.2-api" module under impression that it's an API module (as > advertised) which turned out not to be the case: it's actual bridge > implementation, requiring Log4j2 be provided on the classpath as required > dependency. > For version of Spark < 3.3 this triggers exceptions like the following one > (reported by [~akmodi]) > > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/logging/log4j/LogManagerat > org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55) > at > org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62) > at > org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70) > at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at > org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498)at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) >at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat > java.net.URLClassLoader.findClass(URLClassLoader.java:387)at > java.lang.ClassLoader.loadClass(ClassLoader.java:418)at > java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath
hudi-bot commented on PR #6631: URL: https://github.com/apache/hudi/pull/6631#issuecomment-1240218501 ## CI report: * e8e8c4d8047b5985764f7534bd84e82763c3ad28 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi
hudi-bot commented on PR #6502: URL: https://github.com/apache/hudi/pull/6502#issuecomment-1240218305 ## CI report: * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956) * 18fd090f5b6ea14f970a315788372df5acac7939 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11135) * ccb8f60b4f1280ce8935d5713d03d6d9e0eac8fb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
hudi-bot commented on PR #5478: URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240217636 ## CI report: * 7a9f87cb94043c2447da84ff07ff93009c891174 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11214) * 07f8c3922c20d3350a21ead05f0104ba57af0092 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6616: Add Postgres Schema Name to Postgres Debezium Source
hudi-bot commented on PR #6616: URL: https://github.com/apache/hudi/pull/6616#issuecomment-1240215471 ## CI report: * 25a5a5c619d56e686e6fb38e20e841ef9a1e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11231) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle
hudi-bot commented on PR #6628: URL: https://github.com/apache/hudi/pull/6628#issuecomment-1240215510 ## CI report: * 2504fd6b17a7a3fb2a77f755d7fe6b6c7f83c96f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11232) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] praveenkmr commented on issue #6623: [SUPPORT] java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener with HBase Index
praveenkmr commented on issue #6623: URL: https://github.com/apache/hudi/issues/6623#issuecomment-1240213423 @yihua Thanks a lot, Ethan.. I tried the suggestion and it worked fine... Still, wondering during further upgradation do we need to follow the same approach of loading all the jars during spark-submit or in the latest version there is a scope to use hudi-spark-bundle.jar directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi
jsbali commented on code in PR #6502: URL: https://github.com/apache/hudi/pull/6502#discussion_r965494212 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java: ## @@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig { .sinceVersion("0.7.0") .withDocumentation(""); + public static final ConfigProperty LOCK_METRICS_ENABLE = ConfigProperty + .key(METRIC_PREFIX + ".lock.enable") + .defaultValue(false) Review Comment: Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LXin96 commented on a diff in pull request #6614: [DOCS] Asf site update flink option 'read.tasks & write.tasks' description
LXin96 commented on code in PR #6614: URL: https://github.com/apache/hudi/pull/6614#discussion_r965490917 ## website/docs/configurations.md: ## @@ -978,8 +978,8 @@ Actual value obtained by invoking .toString(), default '' --- > write.tasks -> Parallelism of tasks that do actual write, default is 4 -> **Default Value**: 4 (Optional) +> Parallelism of tasks that do actual write, default is the parallelism of the execution environment +> **Default Value**: N/A (Optional) Review Comment: ok,i get that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi
jsbali commented on code in PR #6502: URL: https://github.com/apache/hudi/pull/6502#discussion_r965487305 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java: ## @@ -130,6 +140,13 @@ public Timer.Context getIndexCtx() { return indexTimer == null ? null : indexTimer.time(); } + public Timer.Context getConflictResolutionCtx() { +if (config.isMetricsOn() && conflictResolutionTimer == null) { Review Comment: Going forward with the infer change I am only checking for LockMetricsOn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240196482 clustering causes data duplication or presto engine adapter issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath
Alexey Kudinkin created HUDI-4810: - Summary: Fix Hudi bundles requiring log4j2 on the classpath Key: HUDI-4810 URL: https://issues.apache.org/jira/browse/HUDI-4810 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin Fix For: 0.12.1 As part of addressing HUDI-4441, we've erroneously rebased Hudi onto "log4j-1.2-api" module under impression that it's an API module (as advertised) which turned out not to be the case: it's actual bridge implementation, requiring Log4j2 be provided on the classpath as required dependency. For version of Spark < 3.3 this triggers exceptions like the following one (reported by [~akmodi]) {code:java} Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManagerat org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55) at org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62) at org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70) at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44) at org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat java.net.URLClassLoader.findClass(URLClassLoader.java:387)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath
[ https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4810: -- Status: In Progress (was: Open) > Fix Hudi bundles requiring log4j2 on the classpath > -- > > Key: HUDI-4810 > URL: https://issues.apache.org/jira/browse/HUDI-4810 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.1 > > > As part of addressing HUDI-4441, we've erroneously rebased Hudi onto > "log4j-1.2-api" module under impression that it's an API module (as > advertised) which turned out not to be the case: it's actual bridge > implementation, requiring Log4j2 be provided on the classpath as required > dependency. > For version of Spark < 3.3 this triggers exceptions like the following one > (reported by [~akmodi]) > > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/logging/log4j/LogManagerat > org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55) > at > org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62) > at > org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70) > at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at > org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498)at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) >at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat > java.net.URLClassLoader.findClass(URLClassLoader.java:387)at > java.lang.ClassLoader.loadClass(ClassLoader.java:418)at > java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath
[ https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4810: -- Sprint: 2022/09/05 > Fix Hudi bundles requiring log4j2 on the classpath > -- > > Key: HUDI-4810 > URL: https://issues.apache.org/jira/browse/HUDI-4810 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.1 > > > As part of addressing HUDI-4441, we've erroneously rebased Hudi onto > "log4j-1.2-api" module under impression that it's an API module (as > advertised) which turned out not to be the case: it's actual bridge > implementation, requiring Log4j2 be provided on the classpath as required > dependency. > For version of Spark < 3.3 this triggers exceptions like the following one > (reported by [~akmodi]) > > {code:java} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/logging/log4j/LogManagerat > org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55) > at > org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62) > at > org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70) > at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at > org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498)at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) > at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) >at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: > java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat > java.net.URLClassLoader.findClass(URLClassLoader.java:387)at > java.lang.ClassLoader.loadClass(ClassLoader.java:418)at > java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin opened a new pull request, #6631: [WIP] Fixing Hudi bundles requiring log4j2 on the classpath
alexeykudinkin opened a new pull request, #6631: URL: https://github.com/apache/hudi/pull/6631 ### Change Logs In XXX, we've rebased Hudi to instead mostly rely on Log4j2 bridge and implementations (in tests). However we actually missed the fact that `log4j-1.2-api` isn't actually an API module (as advertised) but rather a fully-fledged implementation bringing in requirement to provide Log4j2 impl jar on the classpath. ### Impact Risk level: Medium TBD Manual bundles compatibility verification. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi
jsbali commented on code in PR #6502: URL: https://github.com/apache/hudi/pull/6502#discussion_r965477406 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java: ## @@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig { .sinceVersion("0.7.0") .withDocumentation(""); + public static final ConfigProperty LOCK_METRICS_ENABLE = ConfigProperty + .key(METRIC_PREFIX + ".lock.enable") + .defaultValue(false) Review Comment: Ok so we want the default to correspond to metrics.on and false if set explicitly. Is my understanding correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer
hudi-bot commented on PR #5269: URL: https://github.com/apache/hudi/pull/5269#issuecomment-1240185610 ## CI report: * 6f8d22ccc5efbd87ff993a46ea1977355842602f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7944) * a360d286f9a9bff3f60cc7231bc0abfe86675a88 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11240) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi
jsbali commented on code in PR #6502: URL: https://github.com/apache/hudi/pull/6502#discussion_r965475615 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -64,13 +69,18 @@ public void lock() { boolean acquired = false; while (retryCount <= maxRetries) { try { + metrics.startLockApiTimerContext(); acquired = lockProvider.tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS); if (acquired) { +metrics.updateLockAcquiredMetric(); +metrics.startLockHeldTimerContext(); Review Comment: "updateLockAcquiredMetric" has a parallel function also "updateLockNotAcquiredMetricis" is the reason I have kept them separate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer
hudi-bot commented on PR #5269: URL: https://github.com/apache/hudi/pull/5269#issuecomment-1240183216 ## CI report: * 6f8d22ccc5efbd87ff993a46ea1977355842602f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7944) * a360d286f9a9bff3f60cc7231bc0abfe86675a88 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…
hudi-bot commented on PR #6630: URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240181354 ## CI report: * 85a8f5166c17ec5ce9fa00e2c38846f440582acf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11239) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6629: [HUDI-4807] Use base table instant for metadata table initialization
hudi-bot commented on PR #6629: URL: https://github.com/apache/hudi/pull/6629#issuecomment-1240181342 ## CI report: * c88a869d5d8e748edac75698c7c504176a06e47d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11238) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573
hudi-bot commented on PR #6574: URL: https://github.com/apache/hudi/pull/6574#issuecomment-1240181238 ## CI report: * 7ced8cc1e89594e2a074a546a165ce3ef744841f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11211) * b158b5a580ffb609380dcac27a299c9a7557d649 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11237) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…
hudi-bot commented on PR #6630: URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240178449 ## CI report: * 85a8f5166c17ec5ce9fa00e2c38846f440582acf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573
hudi-bot commented on PR #6574: URL: https://github.com/apache/hudi/pull/6574#issuecomment-1240178259 ## CI report: * 7ced8cc1e89594e2a074a546a165ce3ef744841f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11211) * b158b5a580ffb609380dcac27a299c9a7557d649 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6629: [HUDI-4807] Use base table instant for metadata table initialization
hudi-bot commented on PR #6629: URL: https://github.com/apache/hudi/pull/6629#issuecomment-1240178412 ## CI report: * c88a869d5d8e748edac75698c7c504176a06e47d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee closed issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure
Gatsby-Lee closed issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure URL: https://github.com/apache/hudi/issues/6024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gatsby-Lee commented on issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure
Gatsby-Lee commented on issue #6024: URL: https://github.com/apache/hudi/issues/6024#issuecomment-1240177021 Hi, let's close issue if I am the only one facing the issue. Let me write more details before I forget. A couple of months ago, I tried DELETE_PARTITION operation with 0.10.1 and 0.11.0 I noticed that 0.11.0 and 0.10.1 have different behavior when HUDI runs DELETE_PARTITION operation on not existing partition. * 0.10.1 raised exception and failed. ( the serious issue was Hudi became unstable * 0.11.0 was silence. ( VC told me that this is not the right behavior either. It should raise exception ) I wasn't able to use 0.11.0 because it has a compatibility issue in AWS Glue. ( it was related to AWS Glue Catalog ) I wasn't able to use 0.10.1 because it has a bug in ZookeeperLockProvider. I ended up using 0.10.1 + a patch that fixed the ZookeeperLockProvider ( available on 0.11.1 ) And, I added a logic that checks if the target partition exists ( cc @codope ) I will test with 0.11.1 and reopen this ticket if I still notice the similar issue. Thank you Gatsby -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6590: [SUPPORT] HoodieDeltaStreamer AWSDmsAvroPayload fails to handle deletes in MySQL
yihua commented on issue #6590: URL: https://github.com/apache/hudi/issues/6590#issuecomment-1240175913 This is the same issue as #6552. cc @rahil-c -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6552: [SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0
yihua commented on issue #6552: URL: https://github.com/apache/hudi/issues/6552#issuecomment-1240175667 @rahil-c and I discussed this today. The proper fix is to call the corresponding API instead of repeating the invocation of `handleDeleteOperation`: ``` FIXED -> @Override public Option getInsertValue(Schema schema, Properties properties) throws IOException { return getInsertValue(schema); } @Override public Option getInsertValue(Schema schema) throws IOException { IndexedRecord insertValue = super.getInsertValue(schema).get(); return handleDeleteOperation(insertValue); } @Override public Option combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties) throws IOException { return combineAndGetUpdateValue(currentValue, schema); } @Override public Option combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema) throws IOException { IndexedRecord insertValue = super.getInsertValue(schema).get(); return handleDeleteOperation(insertValue); } ``` @rahil-c will put up a fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #6322: [HUDI-4559] Support hiveSync command based on Call Produce Command
xiarixiaoyao commented on PR #6322: URL: https://github.com/apache/hudi/pull/6322#issuecomment-1240173783 @XuQianJin-Stars pls resolve the conflicts, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangp-nhlab commented on pull request #6544: When Hudi choose Append save mode in Spark , the basepath may be error codes
wangp-nhlab commented on PR #6544: URL: https://github.com/apache/hudi/pull/6544#issuecomment-1240166688 > @wangp-nhlab[您可以按照此处](https://hudi.apache.org/contribute/developer-setup#filing-jiras)的流程创建并申请 JIRA 票并将票号附加到 PR吗? Okay -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…
TJX2014 commented on PR #6630: URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240166454 Hi, @danny0405 , this is another patch for https://github.com/apache/hudi/pull/6595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangp-nhlab commented on a diff in pull request #6544: When Hudi choose Append save mode in Spark , the basepath may be error codes
wangp-nhlab commented on code in PR #6544: URL: https://github.com/apache/hudi/pull/6544#discussion_r965463084 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java: ## @@ -176,7 +178,8 @@ private T executeRequest(String requestPath, Map queryParame response = Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute(); break; } -String content = response.returnContent().asString(); +Charset charset = Consts.UTF_8; +String content = response.returnContent().asString(charset); Review Comment: Yes, it is found that only the append mode in Spark has this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…
TJX2014 commented on PR #6630: URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240165726 @minihippo hi, please help me check this, I think this patch could fix the HoodieSimpleBucketIndex firstly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4808) HoodieSimpleBucketIndex should also consider bucket num in log file not in base file which written by flink mor table
[ https://issues.apache.org/jira/browse/HUDI-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4808: - Labels: pull-request-available (was: ) > HoodieSimpleBucketIndex should also consider bucket num in log file not in > base file which written by flink mor table > - > > Key: HUDI-4808 > URL: https://issues.apache.org/jira/browse/HUDI-4808 > Project: Apache Hudi > Issue Type: Bug >Reporter: JinxinTang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope commented on a diff in pull request #6016: [HUDI-4465] Optimizing file-listing sequence of Metadata Table
codope commented on code in PR #6016: URL: https://github.com/apache/hudi/pull/6016#discussion_r965462152 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java: ## @@ -46,6 +47,12 @@ public SimpleKeyGenerator(TypedProperties props) { SimpleKeyGenerator(TypedProperties props, String recordKeyField, String partitionPathField) { super(props); +// Make sure key-generator is configured properly +ValidationUtils.checkArgument(recordKeyField == null || !recordKeyField.isEmpty(), +"Record key field has to be non-empty!"); +ValidationUtils.checkArgument(partitionPathField == null || !partitionPathField.isEmpty(), Review Comment: Fair enough. Should we add these validations to other keygens as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 opened a new pull request, #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…
TJX2014 opened a new pull request, #6630: URL: https://github.com/apache/hudi/pull/6630 ### Change Logs Make HoodieSimpleBucketIndex also load bucket index from log file ### Impact Spark will read bucket index correctly where the log file is written by flink to mor table. **Risk level: none | low | medium | high** none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
dongkelun commented on code in PR #5478: URL: https://github.com/apache/hudi/pull/5478#discussion_r965462117 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws Exception { } } } + + /** + * Determine whether to throw an exception when local view of table's timeline is behind that of client's view. + */ + private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline localTimeline, String lastInstantTs) { Review Comment: For example, when there is only one more .clean.completed, it should also be synchronized -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4809) Hudi Support AWS Glue DropPartitions
XixiHua created HUDI-4809: - Summary: Hudi Support AWS Glue DropPartitions Key: HUDI-4809 URL: https://issues.apache.org/jira/browse/HUDI-4809 Project: Apache Hudi Issue Type: New Feature Components: metadata Reporter: XixiHua -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4808) HoodieSimpleBucketIndex should also consider bucket num in log file not in base file which written by flink mor table
JinxinTang created HUDI-4808: Summary: HoodieSimpleBucketIndex should also consider bucket num in log file not in base file which written by flink mor table Key: HUDI-4808 URL: https://issues.apache.org/jira/browse/HUDI-4808 Project: Apache Hudi Issue Type: Bug Reporter: JinxinTang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
dongkelun commented on code in PR #5478: URL: https://github.com/apache/hudi/pull/5478#discussion_r965460402 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws Exception { } } } + + /** + * Determine whether to throw an exception when local view of table's timeline is behind that of client's view. + */ + private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline localTimeline, String lastInstantTs) { Review Comment: My idea is to judge whether to throw an exception if `isLocalViewBehind` returns true. Because the `isLocalViewBehind` method is also used in the `syncIfLocalViewBehind` method, I am not sure whether it is appropriate to directly modify the logic of the `isLocalViewBehind` method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TJX2014 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…
TJX2014 commented on PR #6595: URL: https://github.com/apache/hudi/pull/6595#issuecomment-1240160173 > I will fix give pr fix in spark side too, but in flink side, I think deduplicate should also open as default option for mor table , when duplicate write to log file, very hard for compact to read, also lead mor table not stable due to the duplicate record twice read into memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240156176 > Remove these config, then data duplication disappeared. why? > > ``` > // option("hoodie.clustering.inline", "true"). > // option("hoodie.clustering.inline.max.commits", "4"). > // option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). > // option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). > // option("hoodie.clustering.plan.strategy.sort.columns", "userId,schoolId,timeStamp"). > ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240156022 > still repeated according to the new patch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] arunb2w commented on issue #6626: [SUPPORT] HUDI merge into via spark sql not working
arunb2w commented on issue #6626: URL: https://github.com/apache/hudi/issues/6626#issuecomment-1240154803 @nsivabalan Can you please provide some help on this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240154151 Remove these config, then data duplication disappeared. why? ``` // option("hoodie.clustering.inline", "true"). // option("hoodie.clustering.inline.max.commits", "4"). // option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). // option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). // option("hoodie.clustering.plan.strategy.sort.columns", "userId,schoolId,timeStamp"). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240151995 still repeated according to the new patch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra
[ https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4722: -- Status: Patch Available (was: In Progress) > Add support for metrics for locking infra > - > > Key: HUDI-4722 > URL: https://issues.apache.org/jira/browse/HUDI-4722 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jagmeet bali >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > Added metrics for following > # Lock request latency > # Count of Lock success > # Count of failed to acquire the lock > # Duration of locks held with support for re-entrancy > # Conflict resolution metrics. Succes vs Failure -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra
[ https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4722: -- Status: In Progress (was: Open) > Add support for metrics for locking infra > - > > Key: HUDI-4722 > URL: https://issues.apache.org/jira/browse/HUDI-4722 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jagmeet bali >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > Added metrics for following > # Lock request latency > # Count of Lock success > # Count of failed to acquire the lock > # Duration of locks held with support for re-entrancy > # Conflict resolution metrics. Succes vs Failure -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4807) Use correct instant in metadata initialization
[ https://issues.apache.org/jira/browse/HUDI-4807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4807: - Labels: pull-request-available (was: ) > Use correct instant in metadata initialization > -- > > Key: HUDI-4807 > URL: https://issues.apache.org/jira/browse/HUDI-4807 > Project: Apache Hudi > Issue Type: Bug >Reporter: Yuwei Xiao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YuweiXiao opened a new pull request, #6629: [HUDI-4807] Use base table instant for metadata table initialization
YuweiXiao opened a new pull request, #6629: URL: https://github.com/apache/hudi/pull/6629 ### Change Logs Use base table instant for metadata table initialization ### Impact No public API change. **Risk level: none | low | medium | high** None. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4615) Fix empty commits being made by deltastreamer with S3EventsSource when there is no data in SQS on starting a new pipeline
[ https://issues.apache.org/jira/browse/HUDI-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-4615. - Resolution: Fixed > Fix empty commits being made by deltastreamer with S3EventsSource when there > is no data in SQS on starting a new pipeline > - > > Key: HUDI-4615 > URL: https://issues.apache.org/jira/browse/HUDI-4615 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: sivabalan narayanan >Assignee: Vinish Reddy >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > When we start a new deltastreamer with S3EventsSource, checkpoint is > Option.empty(). After consumption from source, if there is no data, the > source returns "val=0" as the checkpoint. So, deltastreamer assumes > checkpoint has changed and makes an empty commit. This needs fixing. > > [https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java#L151] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve
nsivabalan commented on code in PR #5030: URL: https://github.com/apache/hudi/pull/5030#discussion_r965448515 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java: ## @@ -123,25 +133,24 @@ public long getNumMergedRecordsInLog() { return numMergedRecordsInLog; } - /** - * Returns the builder for {@code HoodieMergedLogRecordScanner}. - */ - public static HoodieMergedLogRecordScanner.Builder newBuilder() { -return new Builder(); - } - @Override protected void processNextRecord(HoodieRecord hoodieRecord) throws IOException { String key = hoodieRecord.getRecordKey(); if (records.containsKey(key)) { // Merge and store the merged record. The HoodieRecordPayload implementation is free to decide what should be // done when a delete (empty payload) is encountered before or after an insert/update. - - HoodieRecord oldRecord = records.get(key); - HoodieRecordPayload oldValue = oldRecord.getData(); - HoodieRecordPayload combinedValue = hoodieRecord.getData().preCombine(oldValue); - // If combinedValue is oldValue, no need rePut oldRecord - if (combinedValue != oldValue) { + HoodieRecord storeRecord = records.get(key); + HoodieRecordPayload storeValue = storeRecord.getData(); + HoodieRecordPayload combinedValue; + // If revertLogFile = false, storeRecord is the old record. + // If revertLogFile = true, incoming data (hoodieRecord) is the old record. + if (!revertLogFile) { Review Comment: oh I see we have put in a fix here. sounds good. but does below one holds good? ``` delta commit1: insert rec1: val1. preCombine: 2 delta commit2: delete rec1: delta commit2: insert rec1: val2. preCombine: 1 ``` as per master, guess final snapshot will return val2 for rec1. and not the deleted one. can you tell me what will happen w/ this patch where in we reverse the ordering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5091: [HUDI-3453] Fix HoodieBackedTableMetadata concurrent reading issue
hudi-bot commented on PR #5091: URL: https://github.com/apache/hudi/pull/5091#issuecomment-1240147859 ## CI report: * c0dc922eec0ffe4c93f250dcf91dd313713057db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11221) * c711e86c12cc97e9bb28afefe1de0334a07d840a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11236) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4807) Use correct instant in metadata initialization
Yuwei Xiao created HUDI-4807: Summary: Use correct instant in metadata initialization Key: HUDI-4807 URL: https://issues.apache.org/jira/browse/HUDI-4807 Project: Apache Hudi Issue Type: Bug Reporter: Yuwei Xiao -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #5091: [HUDI-3453] Fix HoodieBackedTableMetadata concurrent reading issue
hudi-bot commented on PR #5091: URL: https://github.com/apache/hudi/pull/5091#issuecomment-1240145111 ## CI report: * c0dc922eec0ffe4c93f250dcf91dd313713057db Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11221) * c711e86c12cc97e9bb28afefe1de0334a07d840a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve
nsivabalan commented on code in PR #5030: URL: https://github.com/apache/hudi/pull/5030#discussion_r965446760 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java: ## @@ -280,8 +281,11 @@ HoodieCompactionPlan generateCompactionPlan( .getLatestFileSlices(partitionPath) .filter(slice -> !fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId())) .map(s -> { + // In most business scenarios, the latest data is in the latest delta log file, so we sort it from large + // to small according to the instance time, which can largely avoid rewriting the data in the + // compact process, and then optimize the compact time List logFiles = - s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator()).collect(toList()); + s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator().reversed()).collect(toList()); Review Comment: We might have to consider few cases before can flip the ordering here: case 1: when OverwriteWithLatestAvro is used, if preCombine matches, we pick the latest. for eg: say for eg: delta commit1: insert rec1: val1. preCombine: 1 delta commit2: update rec1: val2. preCombine: 1 delta commit2: insert rec1: val3. preCombine: 1 if we merge as usual (master), final value fo rec1 should be val3. but if we do reverse, then it could result in val1. Case2: some payload implementations take values from old record and combine w/ newer ones. in other words they may not be commutative. for eg, rec1.combineAndGetUpdate(rec2) != rec2.combineAndGetUpdate(rec1) or preCombine() for that matter. I do really like the intend behind this patch. but not sure if its as easy as flipping the order of log file merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
dongkelun commented on code in PR #5478: URL: https://github.com/apache/hudi/pull/5478#discussion_r965446536 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws Exception { } } } + + /** + * Determine whether to throw an exception when local view of table's timeline is behind that of client's view. + */ + private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline localTimeline, String lastInstantTs) { Review Comment: Sorry, if it is called in `isLocalViewBehind`, there is a timeline, but in the `handle` method, I don't see where there is a timeline, and the original code instantiates the timeline in `errMsg` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6615: [HUDI-4758] Add validations to java spark examples
hudi-bot commented on PR #6615: URL: https://github.com/apache/hudi/pull/6615#issuecomment-1240143195 ## CI report: * 3b37307093cf2c6eb20a4e5f738f8bac38f1dba7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11230) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
dongkelun commented on PR #5478: URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240139114 > also, a good practice to follow. whenever you are addressing feedback, try to add it as new commits. Easier for reviewer to re-review just the new changes. if not, I have to review entire patch again. If not, I will click on newer commits and will review only the newly changed code. OK, sorry, I thought it was OK to click compare after 'force pushed'. I mistakenly thought that 'force pushed' would look cleaner. I don't know you reviewed it by comparing two commits. I'll pay attention in the future -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi
nsivabalan commented on code in PR #6502: URL: https://github.com/apache/hudi/pull/6502#discussion_r965438383 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java: ## @@ -130,6 +140,13 @@ public Timer.Context getIndexCtx() { return indexTimer == null ? null : indexTimer.time(); } + public Timer.Context getConflictResolutionCtx() { +if (config.isMetricsOn() && conflictResolutionTimer == null) { Review Comment: shouldn't we check for LockMetricsOn() as well ? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java: ## @@ -64,13 +69,18 @@ public void lock() { boolean acquired = false; while (retryCount <= maxRetries) { try { + metrics.startLockApiTimerContext(); acquired = lockProvider.tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), TimeUnit.MILLISECONDS); if (acquired) { +metrics.updateLockAcquiredMetric(); +metrics.startLockHeldTimerContext(); Review Comment: can we combine both these into single method. we don't have any caller which calls either of these individually right? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java: ## @@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig { .sinceVersion("0.7.0") .withDocumentation(""); + public static final ConfigProperty LOCK_METRICS_ENABLE = ConfigProperty + .key(METRIC_PREFIX + ".lock.enable") + .defaultValue(false) Review Comment: actually we can add an infer function. if not explicitly set by user, we can fetch the value for hoodie.metrics.enable. and we may not need to set default value here. Eg for infer function: ``` public static final ConfigProperty METRICS_REPORTER_PREFIX = ConfigProperty .key(METRIC_PREFIX + ".reporter.metricsname.prefix") .defaultValue("") .sinceVersion("0.11.0") .withInferFunction(cfg -> { if (cfg.contains(HoodieTableConfig.NAME)) { return Option.of(cfg.getString(HoodieTableConfig.NAME)); } return Option.empty(); }) .withDocumentation("The prefix given to the metrics names."); ``` in HoodieMetricsConfig. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java: ## @@ -460,8 +461,19 @@ protected void preCommit(HoodieInstant inflightInstant, HoodieCommitMetadata met // Create a Hoodie table after startTxn which encapsulated the commits and files visible. // Important to create this after the lock to ensure the latest commits show up in the timeline without need for reload HoodieTable table = createTable(config, hadoopConf); -TransactionUtils.resolveWriteConflictIfAny(table, this.txnManager.getCurrentTransactionOwner(), -Option.of(metadata), config, txnManager.getLastCompletedTransactionOwner(), false, this.pendingInflightAndRequestedInstants); +Timer.Context indexTimer = metrics.getConflictResolutionCtx(); Review Comment: minor. `indexTimer` -> `conflictResolutionTimer` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
nsivabalan commented on code in PR #5478: URL: https://github.com/apache/hudi/pull/5478#discussion_r965436152 ## hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java: ## @@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws Exception { } } } + + /** + * Determine whether to throw an exception when local view of table's timeline is behind that of client's view. + */ + private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline localTimeline, String lastInstantTs) { Review Comment: shouldn't we call this from within isLocalViewBehind(). already we have timline there right. we don't need to re-instantiate the timeline again in L507. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6536: [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed
nsivabalan commented on PR #6536: URL: https://github.com/apache/hudi/pull/6536#issuecomment-1240127060 @yihua : can you check the CI failure. please file a tracking jira for enhancing tests. once CI succeeds, you can go ahead and land it in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning
nsivabalan commented on PR #5478: URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240126607 also, a good practice to follow. whenever you are addressing feedback, try to add it as new commits. Easier for reviewer to re-review just the new changes. if not, I have to review entire patch again. If not, I will click on newer commits and will review only the newly changed code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6031: [HUDI-4282] Repair IOException in some other dfs, except hdfs,when check block corrupted in HoodieLogFileReader
nsivabalan commented on code in PR #6031: URL: https://github.com/apache/hudi/pull/6031#discussion_r965431934 ## hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java: ## @@ -632,6 +635,15 @@ public static boolean isGCSFileSystem(FileSystem fs) { return fs.getScheme().equals(StorageSchemes.GCS.getScheme()); } + /** + * Some filesystem(such as chdfs) will throw {@code IOException} instead of {@code EOFException}. It will cause error in isBlockCorrupted(). + * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired offset is out of the file size in advance. + */ + public static boolean shouldWrappedByBoundedDataStream(FileSystem fs) { Review Comment: can we keep it simple for now. ``` public static boolean isCHDSFileSystem(FileSystem fs) { return fs.getScheme().equals(StorageSchemes.CHDS.getScheme()); } ``` if at all we come across other storage schemes which might need this, we can make it a map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240123729 > Thanks, today we'll test according to the new patch. If there's any news, we'll sync it with you again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)
Gump518 commented on issue #6609: URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240123350 Thanks, today we'll test according to the new patch. If there's any news, we'll sync it with you again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6031: [HUDI-4282] Repair IOException in some other dfs, except hdfs,when check block corrupted in HoodieLogFileReader
nsivabalan commented on code in PR #6031: URL: https://github.com/apache/hudi/pull/6031#discussion_r965431369 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java: ## @@ -516,4 +521,23 @@ private static FSDataInputStream getFSDataInputStreamForGCS(FSDataInputStream fs return fsDataInputStream; } + + /** + * Some filesystem(such as chdfs) will throw {@code IOException} instead of {@code EOFException}. It will cause error in isBlockCorrupted(). + * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired offset is out of the file size in advance. + */ + private static FSDataInputStream wrapStreamByBoundedFsDataInputStream(FileSystem fs, Review Comment: if we call this method in Line 490 above, we don't need lines 533 to 539 right. essentially line 493 could be ``` return FSUtils.shouldWrappedByBoundedDataStream(fs) : new BoundedFsDataInputStream(fs, logFile.getPath(), fsDataInputStream): fsDataInputStream; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] santoshsb opened a new issue, #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.
santoshsb opened a new issue, #5452: URL: https://github.com/apache/hudi/issues/5452 Hi Team, We are currently evaluating Hudi for our analytical use cases and as part of this exercise we are facing few issues with schema evolution and data loss. The current issue which we have encountered is while updating a record. We have currently inserted a single record with the following schema ` root |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- maritalStatus: struct (nullable = true) ||-- coding: array (nullable = true) |||-- element: struct (containsNull = true) ||||-- code: string (nullable = true) ||||-- display: string (nullable = true) ||||-- system: string (nullable = true) ||-- text: string (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)` now when we insert the new data with the following schema `root |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- multipleBirthBoolean: boolean (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)` The update is successful but the schema is missing the ` |-- maritalStatus: struct (nullable = true) ||-- coding: array (nullable = true) |||-- element: struct (containsNull = true) ||||-- code: string (nullable = true) ||||-- display: string (nullable = true) ||||-- system: string (nullable = true) ||-- text: string (nullable = true)` field. our expected behaviour was that after adding the second entry, the new column "multipleBirthBoolean" will be added to the overall schema and the previous column "maritalStatus" struct will be retained and will be null for the second entry. The final schema looks like this, `root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- birthDate: string (nullable = true) |-- gender: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- multipleBirthBoolean: boolean (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true)` Basically when a new entry is added and it is missing a column from the destination schema the update is successful and the missing column vanishes from the previous entries. Let us know if we are missing any configuration options. We cannot control the schema as its defined by FHIR standards (https://www.hl7.org/fhir/patient.html#resource) most of the fields here are optional so the incoming data from our customers will be missing certain columns. **Environment Description** * Hudi version : 0.12.0-SNAPSHOT * Spark version : 3.2.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : Local * Running on Docker? (yes/no) : no Thanks for the help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao closed issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.
xiarixiaoyao closed issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert. URL: https://github.com/apache/hudi/issues/5452 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.
xiarixiaoyao commented on issue #5452: URL: https://github.com/apache/hudi/issues/5452#issuecomment-1240122505 @santoshsb you need use schema evolution and hoodie.datasource.write.reconcile.schema, see the follow codes ``` def perf(spark: SparkSession) = { import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor //Define a Patient FHIR resource, for simplicity have deleted most of the elements and retained a few val orgString = """{"resourceType":"Patient","id":"beca9a29-49bb-40e4-adff-4dbb4d664972","lastUpdated":"2022-02-14T15:18:18.90836+05:30","source":"4a0701fe-5c3b-482b-895d-875fcbd2148a","name":[{"use":"official","family":"Keeling57","given":["Serina556"],"prefix":["Ms."]}]}""" val sqlContext = spark.sqlContext import sqlContext.implicits._ val orgStringDf = spark.read.json(Seq(orgString).toDS) //Specify common DataSourceWriteOptions in the single hudiOptions variable val hudiOptions = Map[String,String]( HoodieWriteConfig.TABLE_NAME -> "patient_hudi", DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "source", DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "lastUpdated", DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true") //Write the orgStringDf to a Hudi table orgStringDf.write .format("org.apache.hudi") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .options(hudiOptions) .mode(SaveMode.Overwrite) .save("/work/data/updateTst/hudi/json_schema_tst") //Read the Hudi table val patienthudi = spark.read.format("hudi").load("/work/data/updateTst/hudi/json_schema_tst") //Printschema patienthudi.printSchema //Update: Based on our usecase add a new patient resource, this resource might contain new columns and might not have existing columns (normal use case with FHIR data) val updatedString = """{"resourceType":"Patient","id":"beca9a29-49bb-40e4-adff-4dbb4d664972","lastUpdated":"2022-02-14T15:18:18.90836+05:30","source":"4a0701fe-5c3b-482b-895d-875fcbd2148a","name":[{"use":"official","family":"Keeling57","given":["Serina556"]}]}""" //Convert the new resource string into DF val updatedStringDf = spark.read.json(Seq(updatedString).toDS) //Check the schema of the new resource that is being added updatedStringDf.printSchema //Upsert the new resource spark.sql("set hoodie.schema.on.read.enable=true") updatedStringDf.write .format("org.apache.hudi") .options(hudiOptions) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.common.model.EmptyHoodieRecordPayload") .option("hoodie.datasource.write.reconcile.schema", "true") .mode(SaveMode.Append) .save("/work/data/updateTst/hudi/json_schema_tst") //Read the Hudi table val patienthudiUpdated = spark.read.format("hudi").load("/work/data/updateTst/hudi/json_schema_tst") //Print the schema after adding the new record patienthudiUpdated.printSchema } ``` patienthudiUpdated.schema: |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: string (nullable = true) |-- lastUpdated: string (nullable = true) |-- name: array (nullable = true) ||-- element: struct (containsNull = true) |||-- family: string (nullable = true) |||-- given: array (nullable = true) ||||-- element: string (containsNull = true) |||-- prefix: array (nullable = true) ||||-- element: string (containsNull = true) |||-- use: string (nullable = true) |-- resourceType: string (nullable = true) |-- source: string (nullable = true) i think it should be ok , thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at:
[GitHub] [hudi] xushiyan commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
xushiyan commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r965411452 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -399,9 +451,65 @@ protected void writeIncomingRecords() throws IOException { } } + protected SerializableRecord cdcRecord(HoodieCDCOperation operation, String recordKey, String partitionPath, + GenericRecord oldRecord, GenericRecord newRecord) { +GenericData.Record record; +if (cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE_AFTER)) { + record = CDCUtils.cdcRecord(operation.getValue(), instantTime, Review Comment: can we prefix classes with `Hoodie`? like `HoodieCDCUtils` , which is the convention in the codebase ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -399,9 +451,65 @@ protected void writeIncomingRecords() throws IOException { } } + protected SerializableRecord cdcRecord(HoodieCDCOperation operation, String recordKey, String partitionPath, Review Comment: better name for this kind of method would be starting with `make` or `create`, easier to understand ## hudi-common/src/main/java/org/apache/hudi/common/table/cdc/CDCFileSplit.java: ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.cdc; + +import org.apache.hudi.common.model.FileSlice; +import org.apache.hudi.common.util.Option; + +import java.io.Serializable; + +/** + * This contains all the information that retrieve the change data at a single file group and + * at a single commit. + * + * For [[cdcFileType]] = [[CDCFileTypeEnum.ADD_BASE_FILE]], [[cdcFile]] is a current version of + * the base file in the group, and [[beforeFileSlice]] is None. + * For [[cdcFileType]] = [[CDCFileTypeEnum.REMOVE_BASE_FILE]], [[cdcFile]] is null, + * [[beforeFileSlice]] is the previous version of the base file in the group. + * For [[cdcFileType]] = [[CDCFileTypeEnum.CDC_LOG_FILE]], [[cdcFile]] is a log file with cdc blocks. + * when enable the supplemental logging, both [[beforeFileSlice]] and [[afterFileSlice]] are None, + * otherwise these two are the previous and current version of the base file. + * For [[cdcFileType]] = [[CDCFileTypeEnum.MOR_LOG_FILE]], [[cdcFile]] is a normal log file and + * [[beforeFileSlice]] is the previous version of the file slice. + * For [[cdcFileType]] = [[CDCFileTypeEnum.REPLACED_FILE_GROUP]], [[cdcFile]] is null, + * [[beforeFileSlice]] is the current version of the file slice. + */ +public class CDCFileSplit implements Serializable { Review Comment: HoodieCDCFileSplit ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieCDCDataBlock.java: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.log.block; + +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; + +import org.apache.hadoop.fs.FSDataInputStream; + +import org.apache.hudi.common.util.Option; + +import javax.annotation.Nonnull; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +public class HoodieCDCDataBlock extends HoodieAvroDataBlock { + + public HoodieCDCDataBlock( +
[GitHub] [hudi] nsivabalan commented on pull request #5406: [HUDI-3954] Don't keep the last commit before the earliest commit to retain
nsivabalan commented on PR #5406: URL: https://github.com/apache/hudi/pull/5406#issuecomment-1240120146 hey @danny0405 : may be there is some rational behind the original intent. Its just deducting 1 commit from what user wants right. as of now, I don't feel this is giving us much or fixing any regression. can we drop the patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service
yuzhaojing commented on code in PR #4309: URL: https://github.com/apache/hudi/pull/4309#discussion_r965426924 ## rfc/rfc-43/rfc-43.md: ## @@ -0,0 +1,316 @@ + + +# RFC-43: Implement Table Management ServiceTable Management Service for Hudi + +## Proposers + +- @yuzhaojing + +## Approvers + +- @vinothchandar +- @Raymond + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016) + +## Abstract + +Hudi table needs table management operations. Currently, schedule these job provides Three ways: + +- Inline, execute these job and writing job in the same application, perform the these job and writing job serially. + +- Async, execute these job and writing job in the same application, Async parallel execution of these job and write job. + +- Independent compaction/clustering job, execute an async compaction/clustering job of another application. + +With the increase in the number of HUDI tables, due to a lack of management capabilities, maintenance costs will become +higher. This proposal is to implement an independent compaction/clustering Service to manage the Hudi +compaction/clustering job. + +## Background + +In the current implementation, if the HUDI table needs do compact/cluster, it only has three ways: + +1. Use inline compaction/clustering, in this mode the job will be block writing job. + +2. Using Async compaction/clustering, in this mode the job execute async but also sharing the resource with HUDI to + write a job that may affect the stability of job writing, which is not what the user wants to see. + +3. Using independent compaction/clustering job is a better way to schedule the job, in this mode the job execute async + and do not sharing resources with writing job, but also has some questions: +1. Users have to enable lock service providers so that there is not data loss. Especially when compaction/clustering + is getting scheduled, no other writes should proceed concurrently and hence a lock is required. +2. The user needs to manually start an async compaction/clustering application, which means that the user needs to + maintain two jobs. +3. With the increase in the number of HUDI jobs, there is no unified service to manage compaction/clustering jobs ( + monitor, retry, history, etc...), which will make maintenance costs increase. + +With this effort, we want to provide an independent compaction/clustering Service, it will have these abilities: + +- Provides a pluggable execution interface that can adapt to multiple execution engines, such as Spark and Flink. + +- With the ability to failover, need to be persisted compaction/clustering message. + +- Perfect metrics and reuse HoodieMetric expose to the outside. + +- Provide automatic failure retry for compaction/clustering job. + +## Implementation + +### Processing mode +Different processing modes depending on whether the meta server is enabled + +- Enable meta server +- The pull-based mechanism works for fewer tables. Scanning 1000s of tables for possible services is going to induce lots of a load of listing. +- The meta server provides a listener that takes as input the uris of the Table Management Service and triggers a callback through the hook at each instant commit, thereby calling the Table Management Service to do the scheduling/execution for the table. +![](service_with_meta_server.png) + +- Do not enable meta server +- for every write/commit on the table, the table management server is notified. + We can set a heartbeat timeout for each hoodie table, and if it exceeds it, we will actively pull it once to prevent the commit request from being lost +![](service_without_meta_server.png) + +### Processing flow + +- After receiving the request, the table management server schedules the relevant table service to the table's timeline +- Persist each table service into an instance table of Table Management Service +- notify a separate execution component/thread can start executing it +- Monitor task execution status, update table information, and retry failed table services up to the maximum number of times + +### Storage + +- There are two types of stored information +- Register with the hoodie table of the Table Management Service +- Each table service instance is generated by Table Management Service + + Lectotype + +**Requirements:** support single row ACID transactions. Almost all write operations require it, like operation creation, +status changing and so on. + +There are the candidates, + +**Hudi table** + +pros: + +- No external components are introduced and maintained. + +crons: + +- Each write to hudi table will be a deltacommit, this will further lower the number of possible requests / sec that can + be served. + +**RDBMS** + +pros: + +- database that is suitable for structured data like metadata to store. + +- can describe the relation between many
[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service
yuzhaojing commented on code in PR #4309: URL: https://github.com/apache/hudi/pull/4309#discussion_r965424691 ## rfc/rfc-43/rfc-43.md: ## @@ -0,0 +1,316 @@ + + +# RFC-43: Implement Table Management ServiceTable Management Service for Hudi + +## Proposers + +- @yuzhaojing + +## Approvers + +- @vinothchandar +- @Raymond + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016) + +## Abstract + +Hudi table needs table management operations. Currently, schedule these job provides Three ways: + +- Inline, execute these job and writing job in the same application, perform the these job and writing job serially. + +- Async, execute these job and writing job in the same application, Async parallel execution of these job and write job. + +- Independent compaction/clustering job, execute an async compaction/clustering job of another application. + +With the increase in the number of HUDI tables, due to a lack of management capabilities, maintenance costs will become +higher. This proposal is to implement an independent compaction/clustering Service to manage the Hudi +compaction/clustering job. + +## Background + +In the current implementation, if the HUDI table needs do compact/cluster, it only has three ways: + +1. Use inline compaction/clustering, in this mode the job will be block writing job. + +2. Using Async compaction/clustering, in this mode the job execute async but also sharing the resource with HUDI to + write a job that may affect the stability of job writing, which is not what the user wants to see. + +3. Using independent compaction/clustering job is a better way to schedule the job, in this mode the job execute async + and do not sharing resources with writing job, but also has some questions: +1. Users have to enable lock service providers so that there is not data loss. Especially when compaction/clustering + is getting scheduled, no other writes should proceed concurrently and hence a lock is required. +2. The user needs to manually start an async compaction/clustering application, which means that the user needs to + maintain two jobs. +3. With the increase in the number of HUDI jobs, there is no unified service to manage compaction/clustering jobs ( + monitor, retry, history, etc...), which will make maintenance costs increase. + +With this effort, we want to provide an independent compaction/clustering Service, it will have these abilities: + +- Provides a pluggable execution interface that can adapt to multiple execution engines, such as Spark and Flink. + +- With the ability to failover, need to be persisted compaction/clustering message. + +- Perfect metrics and reuse HoodieMetric expose to the outside. + +- Provide automatic failure retry for compaction/clustering job. + +## Implementation + +### Processing mode +Different processing modes depending on whether the meta server is enabled + +- Enable meta server +- The pull-based mechanism works for fewer tables. Scanning 1000s of tables for possible services is going to induce lots of a load of listing. +- The meta server provides a listener that takes as input the uris of the Table Management Service and triggers a callback through the hook at each instant commit, thereby calling the Table Management Service to do the scheduling/execution for the table. +![](service_with_meta_server.png) + +- Do not enable meta server +- for every write/commit on the table, the table management server is notified. + We can set a heartbeat timeout for each hoodie table, and if it exceeds it, we will actively pull it once to prevent the commit request from being lost +![](service_without_meta_server.png) + +### Processing flow + +- After receiving the request, the table management server schedules the relevant table service to the table's timeline +- Persist each table service into an instance table of Table Management Service +- notify a separate execution component/thread can start executing it +- Monitor task execution status, update table information, and retry failed table services up to the maximum number of times + +### Storage + +- There are two types of stored information +- Register with the hoodie table of the Table Management Service +- Each table service instance is generated by Table Management Service + + Lectotype + +**Requirements:** support single row ACID transactions. Almost all write operations require it, like operation creation, +status changing and so on. + +There are the candidates, + +**Hudi table** + +pros: + +- No external components are introduced and maintained. + +crons: + +- Each write to hudi table will be a deltacommit, this will further lower the number of possible requests / sec that can + be served. + +**RDBMS** + +pros: + +- database that is suitable for structured data like metadata to store. + +- can describe the relation between many
[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service
yuzhaojing commented on code in PR #4309: URL: https://github.com/apache/hudi/pull/4309#discussion_r965424454 ## rfc/rfc-43/rfc-43.md: ## @@ -0,0 +1,316 @@ + + +# RFC-43: Implement Table Management ServiceTable Management Service for Hudi + +## Proposers + +- @yuzhaojing + +## Approvers + +- @vinothchandar +- @Raymond + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016) + +## Abstract + +Hudi table needs table management operations. Currently, schedule these job provides Three ways: + +- Inline, execute these job and writing job in the same application, perform the these job and writing job serially. + +- Async, execute these job and writing job in the same application, Async parallel execution of these job and write job. + +- Independent compaction/clustering job, execute an async compaction/clustering job of another application. + +With the increase in the number of HUDI tables, due to a lack of management capabilities, maintenance costs will become +higher. This proposal is to implement an independent compaction/clustering Service to manage the Hudi +compaction/clustering job. + +## Background + +In the current implementation, if the HUDI table needs do compact/cluster, it only has three ways: + +1. Use inline compaction/clustering, in this mode the job will be block writing job. + +2. Using Async compaction/clustering, in this mode the job execute async but also sharing the resource with HUDI to + write a job that may affect the stability of job writing, which is not what the user wants to see. + +3. Using independent compaction/clustering job is a better way to schedule the job, in this mode the job execute async + and do not sharing resources with writing job, but also has some questions: +1. Users have to enable lock service providers so that there is not data loss. Especially when compaction/clustering + is getting scheduled, no other writes should proceed concurrently and hence a lock is required. +2. The user needs to manually start an async compaction/clustering application, which means that the user needs to + maintain two jobs. +3. With the increase in the number of HUDI jobs, there is no unified service to manage compaction/clustering jobs ( + monitor, retry, history, etc...), which will make maintenance costs increase. + +With this effort, we want to provide an independent compaction/clustering Service, it will have these abilities: + +- Provides a pluggable execution interface that can adapt to multiple execution engines, such as Spark and Flink. + +- With the ability to failover, need to be persisted compaction/clustering message. + +- Perfect metrics and reuse HoodieMetric expose to the outside. + +- Provide automatic failure retry for compaction/clustering job. + +## Implementation + +### Processing mode +Different processing modes depending on whether the meta server is enabled + +- Enable meta server +- The pull-based mechanism works for fewer tables. Scanning 1000s of tables for possible services is going to induce lots of a load of listing. +- The meta server provides a listener that takes as input the uris of the Table Management Service and triggers a callback through the hook at each instant commit, thereby calling the Table Management Service to do the scheduling/execution for the table. +![](service_with_meta_server.png) + +- Do not enable meta server +- for every write/commit on the table, the table management server is notified. + We can set a heartbeat timeout for each hoodie table, and if it exceeds it, we will actively pull it once to prevent the commit request from being lost +![](service_without_meta_server.png) + +### Processing flow + +- After receiving the request, the table management server schedules the relevant table service to the table's timeline +- Persist each table service into an instance table of Table Management Service +- notify a separate execution component/thread can start executing it +- Monitor task execution status, update table information, and retry failed table services up to the maximum number of times + +### Storage + +- There are two types of stored information +- Register with the hoodie table of the Table Management Service +- Each table service instance is generated by Table Management Service + + Lectotype + +**Requirements:** support single row ACID transactions. Almost all write operations require it, like operation creation, +status changing and so on. + +There are the candidates, + +**Hudi table** + +pros: + +- No external components are introduced and maintained. + +crons: + +- Each write to hudi table will be a deltacommit, this will further lower the number of possible requests / sec that can + be served. + +**RDBMS** + +pros: + +- database that is suitable for structured data like metadata to store. + +- can describe the relation between many
[GitHub] [hudi] nsivabalan commented on issue #6552: [SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0
nsivabalan commented on issue #6552: URL: https://github.com/apache/hudi/issues/6552#issuecomment-1240113529 yeah. Udit pointed out the right commit. here is the fix that worked out for me locally. ``` diff --git a/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java b/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java index 20a20fb629..a3c6dde99e 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java @@ -69,21 +69,21 @@ public class AWSDmsAvroPayload extends OverwriteWithLatestAvroPayload { @Override public Option getInsertValue(Schema schema, Properties properties) throws IOException { -IndexedRecord insertValue = super.getInsertValue(schema, properties).get(); -return handleDeleteOperation(insertValue); +Option insertValue = super.getInsertValue(schema, properties); +return insertValue.isPresent() ? handleDeleteOperation(insertValue.get()) : insertValue; } @Override public Option getInsertValue(Schema schema) throws IOException { -IndexedRecord insertValue = super.getInsertValue(schema).get(); -return handleDeleteOperation(insertValue); +Option insertValue = super.getInsertValue(schema); +return insertValue.isPresent() ? handleDeleteOperation(insertValue.get()) : insertValue; } @Override public Option combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema, Properties properties) throws IOException { -IndexedRecord insertValue = super.getInsertValue(schema, properties).get(); -return handleDeleteOperation(insertValue); +Option insertValue = super.getInsertValue(schema, properties); +return insertValue.isPresent() ? handleDeleteOperation(insertValue.get()) : insertValue; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service
yuzhaojing commented on code in PR #4309: URL: https://github.com/apache/hudi/pull/4309#discussion_r965423222 ## rfc/rfc-43/rfc-43.md: ## @@ -0,0 +1,316 @@ + + +# RFC-43: Implement Table Management ServiceTable Management Service for Hudi + +## Proposers + +- @yuzhaojing + +## Approvers + +- @vinothchandar +- @Raymond + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016) + +## Abstract + +Hudi table needs table management operations. Currently, schedule these job provides Three ways: + +- Inline, execute these job and writing job in the same application, perform the these job and writing job serially. + +- Async, execute these job and writing job in the same application, Async parallel execution of these job and write job. + +- Independent compaction/clustering job, execute an async compaction/clustering job of another application. + +With the increase in the number of HUDI tables, due to a lack of management capabilities, maintenance costs will become +higher. This proposal is to implement an independent compaction/clustering Service to manage the Hudi +compaction/clustering job. + +## Background + +In the current implementation, if the HUDI table needs do compact/cluster, it only has three ways: + +1. Use inline compaction/clustering, in this mode the job will be block writing job. + +2. Using Async compaction/clustering, in this mode the job execute async but also sharing the resource with HUDI to + write a job that may affect the stability of job writing, which is not what the user wants to see. + +3. Using independent compaction/clustering job is a better way to schedule the job, in this mode the job execute async + and do not sharing resources with writing job, but also has some questions: +1. Users have to enable lock service providers so that there is not data loss. Especially when compaction/clustering + is getting scheduled, no other writes should proceed concurrently and hence a lock is required. +2. The user needs to manually start an async compaction/clustering application, which means that the user needs to + maintain two jobs. +3. With the increase in the number of HUDI jobs, there is no unified service to manage compaction/clustering jobs ( + monitor, retry, history, etc...), which will make maintenance costs increase. + +With this effort, we want to provide an independent compaction/clustering Service, it will have these abilities: + +- Provides a pluggable execution interface that can adapt to multiple execution engines, such as Spark and Flink. + +- With the ability to failover, need to be persisted compaction/clustering message. + +- Perfect metrics and reuse HoodieMetric expose to the outside. + +- Provide automatic failure retry for compaction/clustering job. + +## Implementation + +### Processing mode +Different processing modes depending on whether the meta server is enabled + +- Enable meta server +- The pull-based mechanism works for fewer tables. Scanning 1000s of tables for possible services is going to induce lots of a load of listing. +- The meta server provides a listener that takes as input the uris of the Table Management Service and triggers a callback through the hook at each instant commit, thereby calling the Table Management Service to do the scheduling/execution for the table. +![](service_with_meta_server.png) + +- Do not enable meta server +- for every write/commit on the table, the table management server is notified. + We can set a heartbeat timeout for each hoodie table, and if it exceeds it, we will actively pull it once to prevent the commit request from being lost +![](service_without_meta_server.png) + +### Processing flow + +- After receiving the request, the table management server schedules the relevant table service to the table's timeline +- Persist each table service into an instance table of Table Management Service +- notify a separate execution component/thread can start executing it +- Monitor task execution status, update table information, and retry failed table services up to the maximum number of times + +### Storage + +- There are two types of stored information +- Register with the hoodie table of the Table Management Service +- Each table service instance is generated by Table Management Service + + Lectotype Review Comment: Will update it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle
danny0405 commented on code in PR #6628: URL: https://github.com/apache/hudi/pull/6628#discussion_r965422148 ## packaging/hudi-flink-bundle/pom.xml: ## @@ -501,8 +501,7 @@ org.apache.avro avro - - 1.10.0 + ${avro.version} Review Comment: Make sure that there are no compatibility issues in flink side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6548: [HUDI-4749] Fixing full cleaning to leverage metadata table
hudi-bot commented on PR #6548: URL: https://github.com/apache/hudi/pull/6548#issuecomment-1240110458 ## CI report: * 78d0f8bb6487e55b91443dcade5285e4a2412e3b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11229) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob
[ https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-4766: - Fix Version/s: 0.12.1 > Fix HoodieFlinkClusteringJob > > > Key: HUDI-4766 > URL: https://issues.apache.org/jira/browse/HUDI-4766 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > h1. Flink Hudi Clustering Issues > > # Integer type used for byte-size configuration parameters instead of long > ** Maximum size range of 2^31-1 bytes ~2 gigabytes > # Unable to choose a particular instant to execute > # Unable to select filter mode as the method that controls this is > overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_ > # No cleaning > ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is > only enabled if _clean.async.enabled = false._ > # Schedule configuration is not consistent with HoodieFlinkCompactor > defining the flag = false, which is opposite of HoodieFlinkCompactor > # No ability to allow props to be passed in using _--props/–hoodie-conf_ > ** Required for passing in configurations like: > *** _hoodie.parquet.compression.ratio_ > *** Partition filter configurations depending on strategy > # Clustering group will spit out files of _hoodie.parquet.max.file.size_ > (120MB by default) > # Multiple clustering jobs can execute, but no fine-grain control over > restarting jobs that have failed. Current implementation will only filter for > REQUESTED clustering jobs; rollbacks will never be performed. > # Removed unused _getNumberOfOutputFileGroups()_ function. > ** _hoodie.clustering.plan.strategy.small.file.limit_ > ** _hoodie.clustering.plan.strategy.max.bytes.per.group_ > ** _hoodie.clustering.plan.strategy.target.file.max.bytes_ > ** Will create N file groups (1 task will be writing to each file group, > increasing parallelism) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4766) Fix HoodieFlinkClusteringJob
[ https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen resolved HUDI-4766. -- > Fix HoodieFlinkClusteringJob > > > Key: HUDI-4766 > URL: https://issues.apache.org/jira/browse/HUDI-4766 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > > h1. Flink Hudi Clustering Issues > > # Integer type used for byte-size configuration parameters instead of long > ** Maximum size range of 2^31-1 bytes ~2 gigabytes > # Unable to choose a particular instant to execute > # Unable to select filter mode as the method that controls this is > overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_ > # No cleaning > ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is > only enabled if _clean.async.enabled = false._ > # Schedule configuration is not consistent with HoodieFlinkCompactor > defining the flag = false, which is opposite of HoodieFlinkCompactor > # No ability to allow props to be passed in using _--props/–hoodie-conf_ > ** Required for passing in configurations like: > *** _hoodie.parquet.compression.ratio_ > *** Partition filter configurations depending on strategy > # Clustering group will spit out files of _hoodie.parquet.max.file.size_ > (120MB by default) > # Multiple clustering jobs can execute, but no fine-grain control over > restarting jobs that have failed. Current implementation will only filter for > REQUESTED clustering jobs; rollbacks will never be performed. > # Removed unused _getNumberOfOutputFileGroups()_ function. > ** _hoodie.clustering.plan.strategy.small.file.limit_ > ** _hoodie.clustering.plan.strategy.max.bytes.per.group_ > ** _hoodie.clustering.plan.strategy.target.file.max.bytes_ > ** Will create N file groups (1 task will be writing to each file group, > increasing parallelism) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-4766) Fix HoodieFlinkClusteringJob
[ https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601571#comment-17601571 ] Danny Chen commented on HUDI-4766: -- Fixed via master branch: adf36093d2454c7e3cd7090a0cb3fd5af140b919 > Fix HoodieFlinkClusteringJob > > > Key: HUDI-4766 > URL: https://issues.apache.org/jira/browse/HUDI-4766 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > h1. Flink Hudi Clustering Issues > > # Integer type used for byte-size configuration parameters instead of long > ** Maximum size range of 2^31-1 bytes ~2 gigabytes > # Unable to choose a particular instant to execute > # Unable to select filter mode as the method that controls this is > overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_ > # No cleaning > ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is > only enabled if _clean.async.enabled = false._ > # Schedule configuration is not consistent with HoodieFlinkCompactor > defining the flag = false, which is opposite of HoodieFlinkCompactor > # No ability to allow props to be passed in using _--props/–hoodie-conf_ > ** Required for passing in configurations like: > *** _hoodie.parquet.compression.ratio_ > *** Partition filter configurations depending on strategy > # Clustering group will spit out files of _hoodie.parquet.max.file.size_ > (120MB by default) > # Multiple clustering jobs can execute, but no fine-grain control over > restarting jobs that have failed. Current implementation will only filter for > REQUESTED clustering jobs; rollbacks will never be performed. > # Removed unused _getNumberOfOutputFileGroups()_ function. > ** _hoodie.clustering.plan.strategy.small.file.limit_ > ** _hoodie.clustering.plan.strategy.max.bytes.per.group_ > ** _hoodie.clustering.plan.strategy.target.file.max.bytes_ > ** Will create N file groups (1 task will be writing to each file group, > increasing parallelism) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service
yuzhaojing commented on code in PR #4309: URL: https://github.com/apache/hudi/pull/4309#discussion_r965421349 ## rfc/rfc-43/rfc-43.md: ## @@ -0,0 +1,316 @@ + + +# RFC-43: Implement Table Management ServiceTable Management Service for Hudi + +## Proposers + +- @yuzhaojing + +## Approvers + +- @vinothchandar +- @Raymond + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016) + +## Abstract + +Hudi table needs table management operations. Currently, schedule these job provides Three ways: + +- Inline, execute these job and writing job in the same application, perform the these job and writing job serially. + +- Async, execute these job and writing job in the same application, Async parallel execution of these job and write job. + +- Independent compaction/clustering job, execute an async compaction/clustering job of another application. + +With the increase in the number of HUDI tables, due to a lack of management capabilities, maintenance costs will become +higher. This proposal is to implement an independent compaction/clustering Service to manage the Hudi +compaction/clustering job. + +## Background + +In the current implementation, if the HUDI table needs do compact/cluster, it only has three ways: + +1. Use inline compaction/clustering, in this mode the job will be block writing job. + +2. Using Async compaction/clustering, in this mode the job execute async but also sharing the resource with HUDI to + write a job that may affect the stability of job writing, which is not what the user wants to see. + +3. Using independent compaction/clustering job is a better way to schedule the job, in this mode the job execute async + and do not sharing resources with writing job, but also has some questions: +1. Users have to enable lock service providers so that there is not data loss. Especially when compaction/clustering + is getting scheduled, no other writes should proceed concurrently and hence a lock is required. +2. The user needs to manually start an async compaction/clustering application, which means that the user needs to + maintain two jobs. +3. With the increase in the number of HUDI jobs, there is no unified service to manage compaction/clustering jobs ( + monitor, retry, history, etc...), which will make maintenance costs increase. + +With this effort, we want to provide an independent compaction/clustering Service, it will have these abilities: + +- Provides a pluggable execution interface that can adapt to multiple execution engines, such as Spark and Flink. + +- With the ability to failover, need to be persisted compaction/clustering message. + +- Perfect metrics and reuse HoodieMetric expose to the outside. + +- Provide automatic failure retry for compaction/clustering job. + +## Implementation + +### Processing mode +Different processing modes depending on whether the meta server is enabled + +- Enable meta server +- The pull-based mechanism works for fewer tables. Scanning 1000s of tables for possible services is going to induce lots of a load of listing. +- The meta server provides a listener that takes as input the uris of the Table Management Service and triggers a callback through the hook at each instant commit, thereby calling the Table Management Service to do the scheduling/execution for the table. +![](service_with_meta_server.png) + +- Do not enable meta server +- for every write/commit on the table, the table management server is notified. + We can set a heartbeat timeout for each hoodie table, and if it exceeds it, we will actively pull it once to prevent the commit request from being lost +![](service_without_meta_server.png) + +### Processing flow + +- After receiving the request, the table management server schedules the relevant table service to the table's timeline Review Comment: I mean scheduling the corresponding table service to the hudi table‘s timeline on storage via TMS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (e8aee84c7c -> adf36093d2)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from e8aee84c7c [HUDI-4793] Fixing ScalaTest tests to properly respect Log4j2 configs (#6617) add adf36093d2 [HUDI-4766] Strengthen flink clustering job (#6566) No new revisions were added by this update. Summary of changes: .../FlinkSizeBasedClusteringPlanStrategy.java | 8 --- .../apache/hudi/configuration/FlinkOptions.java| 12 ++--- .../hudi/sink/clustering/ClusteringCommitSink.java | 7 +++ .../hudi/sink/clustering/ClusteringOperator.java | 5 ++ .../sink/clustering/FlinkClusteringConfig.java | 61 ++ .../sink/clustering/HoodieFlinkClusteringJob.java | 60 + .../hudi/sink/compact/HoodieFlinkCompactor.java| 16 ++ .../java/org/apache/hudi/util/StreamerUtil.java| 4 +- 8 files changed, 136 insertions(+), 37 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #6566: [HUDI-4766] Strengthen flink clustering job
danny0405 merged PR #6566: URL: https://github.com/apache/hudi/pull/6566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…
hudi-bot commented on PR #6625: URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240107831 ## CI report: * 5f385a174df1fa344b87a3a4ada3f3f6d61f1d76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11225) * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11233) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…
hudi-bot commented on PR #6625: URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240104388 ## CI report: * 5f385a174df1fa344b87a3a4ada3f3f6d61f1d76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11225) * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6575: [HUDI-4754] Add compliance check in github actions
hudi-bot commented on PR #6575: URL: https://github.com/apache/hudi/pull/6575#issuecomment-1240093961 ## CI report: * 1600e31836157c8d05e3bc8b9e08e1717471f1a6 UNKNOWN * 4d02f2c64a5fc4b89889677ee639a20b53cec26a UNKNOWN * 48147d19c835e7868102fd2d083659e6ee2ac343 UNKNOWN * c644d730766bcb7c00c7b427d73e56a1c63dbb3a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11228) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi
xushiyan commented on code in PR #6476: URL: https://github.com/apache/hudi/pull/6476#discussion_r956951504 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java: ## @@ -116,12 +125,18 @@ public List close() { String key = newRecordKeysSorted.poll(); HoodieRecord hoodieRecord = keyToNewRecords.get(key); if (!writtenRecordKeys.contains(hoodieRecord.getRecordKey())) { + Option insertRecord; if (useWriterSchemaForCompaction) { -writeRecord(hoodieRecord, hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, config.getProps())); +insertRecord = hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, config.getProps()); } else { -writeRecord(hoodieRecord, hoodieRecord.getData().getInsertValue(tableSchema, config.getProps())); +insertRecord = hoodieRecord.getData().getInsertValue(tableSchema, config.getProps()); } + writeRecord(hoodieRecord, insertRecord); Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4485) Hudi cli got empty result for command show fsview all
[ https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601562#comment-17601562 ] Yao Zhang commented on HUDI-4485: - Hi [~codope] , Finally all unit test issues have been resolved and CI passed. Could you please help review this PR? Thank you very much. > Hudi cli got empty result for command show fsview all > - > > Key: HUDI-4485 > URL: https://issues.apache.org/jira/browse/HUDI-4485 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Affects Versions: 0.11.1 > Environment: Hudi version : 0.11.1 > Spark version : 3.1.1 > Hive version : 3.1.0 > Hadoop version : 3.1.1 >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: spring-shell-1.2.0.RELEASE.jar > > > This issue is from: [[SUPPORT] Hudi cli got empty result for command show > fsview all · Issue #6177 · apache/hudi > (github.com)|https://github.com/apache/hudi/issues/6177] > {*}{{*}}Describe the problem you faced{{*}}{*} > Hudi cli got empty result after running command show fsview all. > ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png]) > The type of table t1 is COW and I am sure that the parquet file is actually > generated inside data folder. Also, the parquet files are not damaged as the > data could be retrieved correctly by reading as Hudi table or directly > reading each parquet file(using Spark). > {*}{{*}}To Reproduce{{*}}{*} > Steps to reproduce the behavior: > 1. Enter Flink SQL client. > 2. Execute the SQL and check the data was written successfully. > ```sql > CREATE TABLE t1( > uuid VARCHAR(20), > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition` VARCHAR(20) > ) > PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'hudi', > 'path' = 'hdfs:///path/to/table/', > 'table.type' = 'COPY_ON_WRITE' > ); > – insert data using values > INSERT INTO t1 VALUES > ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), > ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), > ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), > ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), > ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), > ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), > ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), > ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); > ``` > 3. Enter Hudi cli and execute `show fsview all` > {*}{{*}}Expected behavior{{*}}{*} > `show fsview all` in Hudi cli should return all file slices. > {*}{{*}}Environment Description{{*}}{*} > * Hudi version : 0.11.1 > * Spark version : 3.1.1 > * Hive version : 3.1.0 > * Hadoop version : 3.1.1 > * Storage (HDFS/S3/GCS..) : HDFS > * Running on Docker? (yes/no) : no > {*}{{*}}Additional context{{*}}{*} > No. > {*}{{*}}Stacktrace{{*}}{*} > N/A > > Temporary solution: > I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the > attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …
paul8263 commented on PR #6489: URL: https://github.com/apache/hudi/pull/6489#issuecomment-1240058067 > Hi @codope and @yihua , Errors of hudi-integ-test are almost cleared. The only one left is: > > org.apache.hudi.integ.command.ITTestHoodieSyncCommand.testValidateSync(ITTestHoodieSyncCommand.java:56) > > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11183=logs=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de=30b5aae4-0ea0-5566-42d0-febf71a7061a=146906 > > Is there a way to view the detailed error log in the docker container via Azure? FInally all test failures has been resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6520: [HUDI-4726] Incremental input splits result is not as expected when f…
hudi-bot commented on PR #6520: URL: https://github.com/apache/hudi/pull/6520#issuecomment-1240042837 ## CI report: * e55d28bdafa64d4a5180fd46191a420e702a58dc UNKNOWN * 360821a2d0110a82ac3c56eb65bcc3ad9b9659bf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11227) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle
hudi-bot commented on PR #6628: URL: https://github.com/apache/hudi/pull/6628#issuecomment-1240039267 ## CI report: * 2504fd6b17a7a3fb2a77f755d7fe6b6c7f83c96f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11232) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org