date:20220907

[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6625:
URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240256644

   
   ## CI report:
   
   * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6525: [HUDI-4237] should not sync partition parameters when create non-partition table in spark

2022-09-07 Thread GitBox



alexeykudinkin commented on PR #6525:
URL: https://github.com/apache/hudi/pull/6525#issuecomment-1240241085

   Approved already.
   
   @nsivabalan can you please help landing this one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-07 Thread GitBox



alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1240240348

   @boneanxs will do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



hudi-bot commented on PR #6502:
URL: https://github.com/apache/hudi/pull/6502#issuecomment-1240221361

   
   ## CI report:
   
   * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN
   * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956)
 
   * 18fd090f5b6ea14f970a315788372df5acac7939 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11135)
 
   * ccb8f60b4f1280ce8935d5713d03d6d9e0eac8fb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11241)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread GitBox



hudi-bot commented on PR #6631:
URL: https://github.com/apache/hudi/pull/6631#issuecomment-1240221583

   
   ## CI report:
   
   * e8e8c4d8047b5985764f7534bd84e82763c3ad28 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11243)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



hudi-bot commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240220669

   
   ## CI report:
   
   * 7a9f87cb94043c2447da84ff07ff93009c891174 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11214)
 
   * 07f8c3922c20d3350a21ead05f0104ba57af0092 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11242)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4810:
-
Labels: pull-request-available  (was: )

> Fix Hudi bundles requiring log4j2 on the classpath
> --
>
> Key: HUDI-4810
> URL: https://issues.apache.org/jira/browse/HUDI-4810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> As part of addressing HUDI-4441, we've erroneously rebased Hudi onto 
> "log4j-1.2-api" module under impression that it's an API module (as 
> advertised) which turned out not to be the case: it's actual bridge 
> implementation, requiring Log4j2 be provided on the classpath as required 
> dependency.
> For version of Spark < 3.3 this triggers exceptions like the following one 
> (reported by [~akmodi])
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/LogManagerat 
> org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55)
> at 
> org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62)
> at 
> org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70)
> at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at 
> org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243)  
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)  
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat 
> java.net.URLClassLoader.findClass(URLClassLoader.java:387)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:418)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6631: [HUDI-4810] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread GitBox



hudi-bot commented on PR #6631:
URL: https://github.com/apache/hudi/pull/6631#issuecomment-1240218501

   
   ## CI report:
   
   * e8e8c4d8047b5985764f7534bd84e82763c3ad28 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



hudi-bot commented on PR #6502:
URL: https://github.com/apache/hudi/pull/6502#issuecomment-1240218305

   
   ## CI report:
   
   * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN
   * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956)
 
   * 18fd090f5b6ea14f970a315788372df5acac7939 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11135)
 
   * ccb8f60b4f1280ce8935d5713d03d6d9e0eac8fb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



hudi-bot commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240217636

   
   ## CI report:
   
   * 7a9f87cb94043c2447da84ff07ff93009c891174 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11214)
 
   * 07f8c3922c20d3350a21ead05f0104ba57af0092 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6616: Add Postgres Schema Name to Postgres Debezium Source

2022-09-07 Thread GitBox



hudi-bot commented on PR #6616:
URL: https://github.com/apache/hudi/pull/6616#issuecomment-1240215471

   
   ## CI report:
   
   * 25a5a5c619d56e686e6fb38e20e841ef9a1e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11231)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle

2022-09-07 Thread GitBox



hudi-bot commented on PR #6628:
URL: https://github.com/apache/hudi/pull/6628#issuecomment-1240215510

   
   ## CI report:
   
   * 2504fd6b17a7a3fb2a77f755d7fe6b6c7f83c96f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11232)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] praveenkmr commented on issue #6623: [SUPPORT] java.lang.ClassNotFoundException: Class org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener with HBase Index

2022-09-07 Thread GitBox



praveenkmr commented on issue #6623:
URL: https://github.com/apache/hudi/issues/6623#issuecomment-1240213423

   @yihua  Thanks a lot, Ethan.. I tried the suggestion and it worked fine... 
Still, wondering during further upgradation do we need to follow the same 
approach of loading all the jars during spark-submit or in the latest version 
there is a scope to use hudi-spark-bundle.jar directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



jsbali commented on code in PR #6502:
URL: https://github.com/apache/hudi/pull/6502#discussion_r965494212


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java:
##
@@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig {
   .sinceVersion("0.7.0")
   .withDocumentation("");
 
+  public static final ConfigProperty LOCK_METRICS_ENABLE = 
ConfigProperty
+  .key(METRIC_PREFIX + ".lock.enable")
+  .defaultValue(false)

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] LXin96 commented on a diff in pull request #6614: [DOCS] Asf site update flink option 'read.tasks & write.tasks' description

2022-09-07 Thread GitBox



LXin96 commented on code in PR #6614:
URL: https://github.com/apache/hudi/pull/6614#discussion_r965490917


##
website/docs/configurations.md:
##
@@ -978,8 +978,8 @@ Actual value obtained by invoking .toString(), default 
''
 ---
 
 >  write.tasks
-> Parallelism of tasks that do actual write, default is 4
-> **Default Value**: 4 (Optional)
+> Parallelism of tasks that do actual write, default is the parallelism of the 
execution environment
+> **Default Value**: N/A (Optional)

Review Comment:
   ok，i get that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



jsbali commented on code in PR #6502:
URL: https://github.com/apache/hudi/pull/6502#discussion_r965487305


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java:
##
@@ -130,6 +140,13 @@ public Timer.Context getIndexCtx() {
 return indexTimer == null ? null : indexTimer.time();
   }
 
+  public Timer.Context getConflictResolutionCtx() {
+if (config.isMetricsOn() && conflictResolutionTimer == null) {

Review Comment:
   Going forward with the infer change I am only checking for LockMetricsOn



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240196482

   clustering causes data duplication or presto engine adapter issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4810:
-

 Summary: Fix Hudi bundles requiring log4j2 on the classpath
 Key: HUDI-4810
 URL: https://issues.apache.org/jira/browse/HUDI-4810
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.12.1


As part of addressing HUDI-4441, we've erroneously rebased Hudi onto 
"log4j-1.2-api" module under impression that it's an API module (as advertised) 
which turned out not to be the case: it's actual bridge implementation, 
requiring Log4j2 be provided on the classpath as required dependency.

For version of Spark < 3.3 this triggers exceptions like the following one 
(reported by [~akmodi])

 
{code:java}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/logging/log4j/LogManagerat 
org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55)
at 
org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62)
at 
org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70)
at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at 
org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116)
at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)   
 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat 
java.net.URLClassLoader.findClass(URLClassLoader.java:387)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:418)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4810:
--
Status: In Progress  (was: Open)

> Fix Hudi bundles requiring log4j2 on the classpath
> --
>
> Key: HUDI-4810
> URL: https://issues.apache.org/jira/browse/HUDI-4810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
> As part of addressing HUDI-4441, we've erroneously rebased Hudi onto 
> "log4j-1.2-api" module under impression that it's an API module (as 
> advertised) which turned out not to be the case: it's actual bridge 
> implementation, requiring Log4j2 be provided on the classpath as required 
> dependency.
> For version of Spark < 3.3 this triggers exceptions like the following one 
> (reported by [~akmodi])
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/LogManagerat 
> org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55)
> at 
> org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62)
> at 
> org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70)
> at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at 
> org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243)  
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)  
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat 
> java.net.URLClassLoader.findClass(URLClassLoader.java:387)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:418)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4810) Fix Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4810:
--
Sprint: 2022/09/05

> Fix Hudi bundles requiring log4j2 on the classpath
> --
>
> Key: HUDI-4810
> URL: https://issues.apache.org/jira/browse/HUDI-4810
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
> As part of addressing HUDI-4441, we've erroneously rebased Hudi onto 
> "log4j-1.2-api" module under impression that it's an API module (as 
> advertised) which turned out not to be the case: it's actual bridge 
> implementation, requiring Log4j2 be provided on the classpath as required 
> dependency.
> For version of Spark < 3.3 this triggers exceptions like the following one 
> (reported by [~akmodi])
>  
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/LogManagerat 
> org.apache.hudi.metrics.datadog.DatadogReporter.(DatadogReporter.java:55)
> at 
> org.apache.hudi.metrics.datadog.DatadogMetricsReporter.(DatadogMetricsReporter.java:62)
> at 
> org.apache.hudi.metrics.MetricsReporterFactory.createReporter(MetricsReporterFactory.java:70)
> at org.apache.hudi.metrics.Metrics.(Metrics.java:50)at 
> org.apache.hudi.metrics.Metrics.init(Metrics.java:96)at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics.(HoodieDeltaStreamerMetrics.java:44)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.(DeltaSync.java:243)  
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:663)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:143)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:116)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:562)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)  
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManagerat 
> java.net.URLClassLoader.findClass(URLClassLoader.java:387)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:418)at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 23 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] alexeykudinkin opened a new pull request, #6631: [WIP] Fixing Hudi bundles requiring log4j2 on the classpath

2022-09-07 Thread GitBox



alexeykudinkin opened a new pull request, #6631:
URL: https://github.com/apache/hudi/pull/6631

   ### Change Logs
   
   In XXX, we've rebased Hudi to instead mostly rely on Log4j2 bridge and 
implementations (in tests). 
   However we actually missed the fact that `log4j-1.2-api` isn't actually an 
API module (as advertised) but rather a fully-fledged implementation bringing 
in requirement to provide Log4j2 impl jar on the classpath.
   
   ### Impact
   
   Risk level: Medium
   
   TBD Manual bundles compatibility verification.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



jsbali commented on code in PR #6502:
URL: https://github.com/apache/hudi/pull/6502#discussion_r965477406


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java:
##
@@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig {
   .sinceVersion("0.7.0")
   .withDocumentation("");
 
+  public static final ConfigProperty LOCK_METRICS_ENABLE = 
ConfigProperty
+  .key(METRIC_PREFIX + ".lock.enable")
+  .defaultValue(false)

Review Comment:
   Ok so we want the default to correspond to metrics.on and false if set 
explicitly. Is my understanding correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer

2022-09-07 Thread GitBox



hudi-bot commented on PR #5269:
URL: https://github.com/apache/hudi/pull/5269#issuecomment-1240185610

   
   ## CI report:
   
   * 6f8d22ccc5efbd87ff993a46ea1977355842602f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7944)
 
   * a360d286f9a9bff3f60cc7231bc0abfe86675a88 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11240)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jsbali commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



jsbali commented on code in PR #6502:
URL: https://github.com/apache/hudi/pull/6502#discussion_r965475615


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -64,13 +69,18 @@ public void lock() {
   boolean acquired = false;
   while (retryCount <= maxRetries) {
 try {
+  metrics.startLockApiTimerContext();
   acquired = 
lockProvider.tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS);
   if (acquired) {
+metrics.updateLockAcquiredMetric();
+metrics.startLockHeldTimerContext();

Review Comment:
   "updateLockAcquiredMetric" has a parallel function also 
"updateLockNotAcquiredMetricis" is the reason I have kept them separate. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer

2022-09-07 Thread GitBox



hudi-bot commented on PR #5269:
URL: https://github.com/apache/hudi/pull/5269#issuecomment-1240183216

   
   ## CI report:
   
   * 6f8d22ccc5efbd87ff993a46ea1977355842602f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7944)
 
   * a360d286f9a9bff3f60cc7231bc0abfe86675a88 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6630:
URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240181354

   
   ## CI report:
   
   * 85a8f5166c17ec5ce9fa00e2c38846f440582acf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11239)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6629: [HUDI-4807] Use base table instant for metadata table initialization

2022-09-07 Thread GitBox



hudi-bot commented on PR #6629:
URL: https://github.com/apache/hudi/pull/6629#issuecomment-1240181342

   
   ## CI report:
   
   * c88a869d5d8e748edac75698c7c504176a06e47d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11238)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-07 Thread GitBox



hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1240181238

   
   ## CI report:
   
   * 7ced8cc1e89594e2a074a546a165ce3ef744841f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11211)
 
   * b158b5a580ffb609380dcac27a299c9a7557d649 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11237)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6630:
URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240178449

   
   ## CI report:
   
   * 85a8f5166c17ec5ce9fa00e2c38846f440582acf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6574: Keep a clustering running at the same time.#6573

2022-09-07 Thread GitBox



hudi-bot commented on PR #6574:
URL: https://github.com/apache/hudi/pull/6574#issuecomment-1240178259

   
   ## CI report:
   
   * 7ced8cc1e89594e2a074a546a165ce3ef744841f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11211)
 
   * b158b5a580ffb609380dcac27a299c9a7557d649 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6629: [HUDI-4807] Use base table instant for metadata table initialization

2022-09-07 Thread GitBox



hudi-bot commented on PR #6629:
URL: https://github.com/apache/hudi/pull/6629#issuecomment-1240178412

   
   ## CI report:
   
   * c88a869d5d8e748edac75698c7c504176a06e47d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gatsby-Lee closed issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure

2022-09-07 Thread GitBox



Gatsby-Lee closed issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena 
Query failure
URL: https://github.com/apache/hudi/issues/6024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gatsby-Lee commented on issue #6024: [SUPPORT] DELETE_PARTITION causes AWS Athena Query failure

2022-09-07 Thread GitBox



Gatsby-Lee commented on issue #6024:
URL: https://github.com/apache/hudi/issues/6024#issuecomment-1240177021

   Hi, let's close issue if I am the only one facing the issue.
   
   Let me write more details before I forget.
   
   A couple of months ago, I tried DELETE_PARTITION operation with 0.10.1 and 
0.11.0
   I noticed that 0.11.0 and 0.10.1 have different behavior when HUDI runs 
DELETE_PARTITION operation on not existing partition.
   
   * 0.10.1 raised exception and failed. ( the serious issue was Hudi became 
unstable 
   * 0.11.0 was silence. ( VC told me that this is not the right behavior 
either. It should raise exception )
   
   I wasn't able to use 0.11.0 because it has a compatibility issue in AWS 
Glue. ( it was related to AWS Glue Catalog )
   I wasn't able to use 0.10.1 because it has a bug in ZookeeperLockProvider.
   
   I ended up using 0.10.1 + a patch that fixed the  ZookeeperLockProvider ( 
available on 0.11.1 )
   And, I added a logic that checks if the target partition exists ( cc @codope 
)
   
   I will test with 0.11.1 and reopen this ticket if I still notice the similar 
issue.
   
   Thank you
   Gatsby


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6590: [SUPPORT] HoodieDeltaStreamer AWSDmsAvroPayload fails to handle deletes in MySQL

2022-09-07 Thread GitBox



yihua commented on issue #6590:
URL: https://github.com/apache/hudi/issues/6590#issuecomment-1240175913

   This is the same issue as #6552.  cc @rahil-c 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6552: [SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0

2022-09-07 Thread GitBox



yihua commented on issue #6552:
URL: https://github.com/apache/hudi/issues/6552#issuecomment-1240175667

   @rahil-c and I discussed this today.  The proper fix is to call the 
corresponding API instead of repeating the invocation of 
`handleDeleteOperation`:
   ```
   FIXED ->
   @Override
 public Option getInsertValue(Schema schema, Properties 
properties) throws IOException {
   return getInsertValue(schema);
 }
   
 @Override
 public Option getInsertValue(Schema schema) throws 
IOException {
   IndexedRecord insertValue = super.getInsertValue(schema).get();
   return handleDeleteOperation(insertValue);
 }
   
   @Override
 public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema, Properties properties)
 throws IOException {
   return combineAndGetUpdateValue(currentValue, schema);
 }
   
 @Override
 public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema)
 throws IOException {
   IndexedRecord insertValue = super.getInsertValue(schema).get();
   return handleDeleteOperation(insertValue);
 }
   ```
   @rahil-c will put up a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #6322: [HUDI-4559] Support hiveSync command based on Call Produce Command

2022-09-07 Thread GitBox



xiarixiaoyao commented on PR #6322:
URL: https://github.com/apache/hudi/pull/6322#issuecomment-1240173783

   @XuQianJin-Stars  pls  resolve the conflicts, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangp-nhlab commented on pull request #6544: When Hudi choose Append save mode in Spark , the basepath may be error codes

2022-09-07 Thread GitBox



wangp-nhlab commented on PR #6544:
URL: https://github.com/apache/hudi/pull/6544#issuecomment-1240166688

   > 
@wangp-nhlab[您可以按照此处](https://hudi.apache.org/contribute/developer-setup#filing-jiras)的流程创建并申请
 JIRA 票并将票号附加到 PR吗？
   Okay
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-07 Thread GitBox



TJX2014 commented on PR #6630:
URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240166454

   Hi, @danny0405 , this is another patch for 
https://github.com/apache/hudi/pull/6595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangp-nhlab commented on a diff in pull request #6544: When Hudi choose Append save mode in Spark , the basepath may be error codes

2022-09-07 Thread GitBox



wangp-nhlab commented on code in PR #6544:
URL: https://github.com/apache/hudi/pull/6544#discussion_r965463084


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java:
##
@@ -176,7 +178,8 @@ private  T executeRequest(String requestPath, 
Map queryParame
 response = 
Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute();
 break;
 }
-String content = response.returnContent().asString();
+Charset charset = Consts.UTF_8;
+String content = response.returnContent().asString(charset);

Review Comment:
   Yes, it is found that only the append mode in Spark has this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-07 Thread GitBox



TJX2014 commented on PR #6630:
URL: https://github.com/apache/hudi/pull/6630#issuecomment-1240165726

   @minihippo hi, please help me check this, I think this patch could fix the 
HoodieSimpleBucketIndex firstly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4808) HoodieSimpleBucketIndex should also consider bucket num in log file not in base file which written by flink mor table

2022-09-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4808:
-
Labels: pull-request-available  (was: )

> HoodieSimpleBucketIndex should also consider bucket num in log file not in 
> base file which written by flink mor table
> -
>
> Key: HUDI-4808
> URL: https://issues.apache.org/jira/browse/HUDI-4808
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: JinxinTang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] codope commented on a diff in pull request #6016: [HUDI-4465] Optimizing file-listing sequence of Metadata Table

2022-09-07 Thread GitBox



codope commented on code in PR #6016:
URL: https://github.com/apache/hudi/pull/6016#discussion_r965462152


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java:
##
@@ -46,6 +47,12 @@ public SimpleKeyGenerator(TypedProperties props) {
 
   SimpleKeyGenerator(TypedProperties props, String recordKeyField, String 
partitionPathField) {
 super(props);
+// Make sure key-generator is configured properly
+ValidationUtils.checkArgument(recordKeyField == null || 
!recordKeyField.isEmpty(),
+"Record key field has to be non-empty!");
+ValidationUtils.checkArgument(partitionPathField == null || 
!partitionPathField.isEmpty(),

Review Comment:
   Fair enough. 
   Should we add these validations to other keygens as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 opened a new pull request, #6630: [HUDI-4808] Fix HoodieSimpleBucketIndex not consider bucket num in lo…

2022-09-07 Thread GitBox



TJX2014 opened a new pull request, #6630:
URL: https://github.com/apache/hudi/pull/6630

   ### Change Logs
   Make HoodieSimpleBucketIndex also load bucket index from log file
   
   ### Impact
   Spark will read bucket index correctly where the log file is written by 
flink to mor table. 
   
   **Risk level: none | low | medium | high**
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



dongkelun commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r965462117


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws 
Exception {
   }
 }
   }
+
+  /**
+   * Determine whether to throw an exception when local view of table's 
timeline is behind that of client's view.
+   */
+  private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline 
localTimeline, String lastInstantTs) {

Review Comment:
   For example, when there is only one more .clean.completed, it should also be 
synchronized



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4809) Hudi Support AWS Glue DropPartitions

2022-09-07 Thread XixiHua (Jira)

XixiHua created HUDI-4809:
-

 Summary: Hudi Support AWS Glue DropPartitions 
 Key: HUDI-4809
 URL: https://issues.apache.org/jira/browse/HUDI-4809
 Project: Apache Hudi
  Issue Type: New Feature
  Components: metadata
Reporter: XixiHua






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-4808) HoodieSimpleBucketIndex should also consider bucket num in log file not in base file which written by flink mor table

2022-09-07 Thread JinxinTang (Jira)

JinxinTang created HUDI-4808:


 Summary: HoodieSimpleBucketIndex should also consider bucket num 
in log file not in base file which written by flink mor table
 Key: HUDI-4808
 URL: https://issues.apache.org/jira/browse/HUDI-4808
 Project: Apache Hudi
  Issue Type: Bug
Reporter: JinxinTang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



dongkelun commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r965460402


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws 
Exception {
   }
 }
   }
+
+  /**
+   * Determine whether to throw an exception when local view of table's 
timeline is behind that of client's view.
+   */
+  private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline 
localTimeline, String lastInstantTs) {

Review Comment:
   My idea is to judge whether to throw an exception if `isLocalViewBehind` 
returns true. Because the `isLocalViewBehind` method is also used in the 
`syncIfLocalViewBehind` method, I am not sure whether it is appropriate to 
directly modify the logic of the `isLocalViewBehind`  method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TJX2014 commented on pull request #6595: [HUDI-4777] Fix flink gen bucket index of mor table not consistent wi…

2022-09-07 Thread GitBox



TJX2014 commented on PR #6595:
URL: https://github.com/apache/hudi/pull/6595#issuecomment-1240160173

   > 
   
   I will fix give pr fix in spark side too， but in flink side, I think 
deduplicate should also open as default option for mor table , when duplicate 
write to log file, very hard for compact to read, also lead mor table not 
stable due to the duplicate record twice read into memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240156176

   > Remove these config, then data duplication disappeared. why?
   > 
   > ```
   > //  option("hoodie.clustering.inline", "true").
   > //  option("hoodie.clustering.inline.max.commits", "4").
   > //  option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
"1073741824").
   > //  option("hoodie.clustering.plan.strategy.small.file.limit", 
"629145600").
   > //  option("hoodie.clustering.plan.strategy.sort.columns", 
"userId,schoolId,timeStamp").
   > ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240156022

   > still repeated according to the new patch
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] arunb2w commented on issue #6626: [SUPPORT] HUDI merge into via spark sql not working

2022-09-07 Thread GitBox



arunb2w commented on issue #6626:
URL: https://github.com/apache/hudi/issues/6626#issuecomment-1240154803

   @nsivabalan Can you please provide some help on this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240154151

   Remove these config, then data duplication disappeared.  why?
   ```
   //  option("hoodie.clustering.inline", "true").
   //  option("hoodie.clustering.inline.max.commits", "4").
   //  option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
"1073741824").
   //  option("hoodie.clustering.plan.strategy.small.file.limit", 
"629145600").
   //  option("hoodie.clustering.plan.strategy.sort.columns", 
"userId,schoolId,timeStamp").
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240151995

   still repeated according to the new patch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
Status: Patch Available  (was: In Progress)

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-09-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4722:
--
Status: In Progress  (was: Open)

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4807) Use correct instant in metadata initialization

2022-09-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4807:
-
Labels: pull-request-available  (was: )

> Use correct instant in metadata initialization
> --
>
> Key: HUDI-4807
> URL: https://issues.apache.org/jira/browse/HUDI-4807
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] YuweiXiao opened a new pull request, #6629: [HUDI-4807] Use base table instant for metadata table initialization

2022-09-07 Thread GitBox



YuweiXiao opened a new pull request, #6629:
URL: https://github.com/apache/hudi/pull/6629

   ### Change Logs
   
   Use base table instant for metadata table initialization
   
   ### Impact
   
   No public API change.
   
   **Risk level: none | low | medium | high**
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4615) Fix empty commits being made by deltastreamer with S3EventsSource when there is no data in SQS on starting a new pipeline

2022-09-07 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4615.
-
Resolution: Fixed

> Fix empty commits being made by deltastreamer with S3EventsSource when there 
> is no data in SQS on starting a new pipeline
> -
>
> Key: HUDI-4615
> URL: https://issues.apache.org/jira/browse/HUDI-4615
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: Vinish Reddy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When we start a new deltastreamer with S3EventsSource, checkpoint is 
> Option.empty(). After consumption from source, if there is no data, the 
> source returns "val=0" as the checkpoint. So, deltastreamer assumes 
> checkpoint has changed and makes an empty commit. This needs fixing. 
>  
> [https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/S3EventsMetaSelector.java#L151]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r965448515


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -123,25 +133,24 @@ public long getNumMergedRecordsInLog() {
 return numMergedRecordsInLog;
   }
 
-  /**
-   * Returns the builder for {@code HoodieMergedLogRecordScanner}.
-   */
-  public static HoodieMergedLogRecordScanner.Builder newBuilder() {
-return new Builder();
-  }
-
   @Override
   protected void processNextRecord(HoodieRecord 
hoodieRecord) throws IOException {
 String key = hoodieRecord.getRecordKey();
 if (records.containsKey(key)) {
   // Merge and store the merged record. The HoodieRecordPayload 
implementation is free to decide what should be
   // done when a delete (empty payload) is encountered before or after an 
insert/update.
-
-  HoodieRecord oldRecord = records.get(key);
-  HoodieRecordPayload oldValue = oldRecord.getData();
-  HoodieRecordPayload combinedValue = 
hoodieRecord.getData().preCombine(oldValue);
-  // If combinedValue is oldValue, no need rePut oldRecord
-  if (combinedValue != oldValue) {
+  HoodieRecord storeRecord = 
records.get(key);
+  HoodieRecordPayload storeValue = storeRecord.getData();
+  HoodieRecordPayload combinedValue;
+  // If revertLogFile = false, storeRecord is the old record.
+  // If revertLogFile = true, incoming data (hoodieRecord) is the old 
record.
+  if (!revertLogFile) {

Review Comment:
   oh I see we have put in a fix here. sounds good. 
   
   but does below one holds good?
   ```
   delta commit1: insert rec1: val1. preCombine: 2
   delta commit2: delete rec1: 
   delta commit2: insert rec1: val2. preCombine: 1
   ```
   
   as per master, 
   guess final snapshot will return val2 for rec1. and not the deleted one. 
   
   can you tell me what will happen w/ this patch where in we reverse the 
ordering. 
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5091: [HUDI-3453] Fix HoodieBackedTableMetadata concurrent reading issue

2022-09-07 Thread GitBox



hudi-bot commented on PR #5091:
URL: https://github.com/apache/hudi/pull/5091#issuecomment-1240147859

   
   ## CI report:
   
   * c0dc922eec0ffe4c93f250dcf91dd313713057db Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11221)
 
   * c711e86c12cc97e9bb28afefe1de0334a07d840a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11236)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4807) Use correct instant in metadata initialization

2022-09-07 Thread Yuwei Xiao (Jira)

Yuwei Xiao created HUDI-4807:


 Summary: Use correct instant in metadata initialization
 Key: HUDI-4807
 URL: https://issues.apache.org/jira/browse/HUDI-4807
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yuwei Xiao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #5091: [HUDI-3453] Fix HoodieBackedTableMetadata concurrent reading issue

2022-09-07 Thread GitBox



hudi-bot commented on PR #5091:
URL: https://github.com/apache/hudi/pull/5091#issuecomment-1240145111

   
   ## CI report:
   
   * c0dc922eec0ffe4c93f250dcf91dd313713057db Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11221)
 
   * c711e86c12cc97e9bb28afefe1de0334a07d840a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5030: [HUDI-3617] MOR compact improve

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r965446760


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java:
##
@@ -280,8 +281,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // In most business scenarios, the latest data is in the latest 
delta log file, so we sort it from large
+  // to small according to the instance time, which can largely avoid 
rewriting the data in the
+  // compact process, and then optimize the compact time
   List logFiles =
-  
s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator()).collect(toList());
+  
s.getLogFiles().sorted(HoodieLogFile.getLogFileComparator().reversed()).collect(toList());

Review Comment:
   We might have to consider few cases before can flip the ordering here:
   case 1: when OverwriteWithLatestAvro is used, if preCombine matches, we pick 
the latest. 
   for eg: 
   say for eg:
   delta commit1: insert rec1: val1. preCombine: 1
   delta commit2: update rec1: val2. preCombine: 1
   delta commit2: insert rec1: val3.  preCombine: 1
   
   if we merge as usual (master), 
   final value fo rec1 should be val3. 
   but if we do reverse, then it could result in val1. 
   
   Case2: 
   some payload implementations take values from old record and combine w/ 
newer ones. in other words they may not be commutative. 
   for eg,
   rec1.combineAndGetUpdate(rec2) != rec2.combineAndGetUpdate(rec1) 
   
   or preCombine() for that matter. 
   
   I do really like the intend behind this patch. but not sure if its as easy 
as flipping the order of log file merging. 
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



dongkelun commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r965446536


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws 
Exception {
   }
 }
   }
+
+  /**
+   * Determine whether to throw an exception when local view of table's 
timeline is behind that of client's view.
+   */
+  private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline 
localTimeline, String lastInstantTs) {

Review Comment:
   Sorry, if it is called in `isLocalViewBehind`, there is a timeline, but in 
the `handle` method, I don't see where there is a timeline, and the original 
code instantiates the timeline in `errMsg`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6615: [HUDI-4758] Add validations to java spark examples

2022-09-07 Thread GitBox



hudi-bot commented on PR #6615:
URL: https://github.com/apache/hudi/pull/6615#issuecomment-1240143195

   
   ## CI report:
   
   * 3b37307093cf2c6eb20a4e5f738f8bac38f1dba7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11230)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



dongkelun commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240139114

   > also, a good practice to follow. whenever you are addressing feedback, try 
to add it as new commits. Easier for reviewer to re-review just the new 
changes. if not, I have to review entire patch again. If not, I will click on 
newer commits and will review only the newly changed code.
   OK, sorry, I thought it was OK to click compare after 'force pushed'. I 
mistakenly thought that 'force pushed' would look cleaner. I don't know you 
reviewed it by comparing two commits. I'll pay attention in the future
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #6502:
URL: https://github.com/apache/hudi/pull/6502#discussion_r965438383


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java:
##
@@ -130,6 +140,13 @@ public Timer.Context getIndexCtx() {
 return indexTimer == null ? null : indexTimer.time();
   }
 
+  public Timer.Context getConflictResolutionCtx() {
+if (config.isMetricsOn() && conflictResolutionTimer == null) {

Review Comment:
   shouldn't we check for LockMetricsOn() as well ?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java:
##
@@ -64,13 +69,18 @@ public void lock() {
   boolean acquired = false;
   while (retryCount <= maxRetries) {
 try {
+  metrics.startLockApiTimerContext();
   acquired = 
lockProvider.tryLock(writeConfig.getLockAcquireWaitTimeoutInMs(), 
TimeUnit.MILLISECONDS);
   if (acquired) {
+metrics.updateLockAcquiredMetric();
+metrics.startLockHeldTimerContext();

Review Comment:
   can we combine both these into single method. we don't have any caller which 
calls either of these individually right? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/metrics/HoodieMetricsConfig.java:
##
@@ -83,6 +83,11 @@ public class HoodieMetricsConfig extends HoodieConfig {
   .sinceVersion("0.7.0")
   .withDocumentation("");
 
+  public static final ConfigProperty LOCK_METRICS_ENABLE = 
ConfigProperty
+  .key(METRIC_PREFIX + ".lock.enable")
+  .defaultValue(false)

Review Comment:
   actually we can add an infer function. if not explicitly set by user, we can 
fetch the value for hoodie.metrics.enable. 
   and we may not need to set default value here. 
   Eg for infer function: 
   ```
 public static final ConfigProperty METRICS_REPORTER_PREFIX = 
ConfigProperty
 .key(METRIC_PREFIX + ".reporter.metricsname.prefix")
 .defaultValue("")
 .sinceVersion("0.11.0")
 .withInferFunction(cfg -> {
   if (cfg.contains(HoodieTableConfig.NAME)) {
 return Option.of(cfg.getString(HoodieTableConfig.NAME));
   }
   return Option.empty();
 })
 .withDocumentation("The prefix given to the metrics names.");
   ```
   in HoodieMetricsConfig.



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java:
##
@@ -460,8 +461,19 @@ protected void preCommit(HoodieInstant inflightInstant, 
HoodieCommitMetadata met
 // Create a Hoodie table after startTxn which encapsulated the commits and 
files visible.
 // Important to create this after the lock to ensure the latest commits 
show up in the timeline without need for reload
 HoodieTable table = createTable(config, hadoopConf);
-TransactionUtils.resolveWriteConflictIfAny(table, 
this.txnManager.getCurrentTransactionOwner(),
-Option.of(metadata), config, 
txnManager.getLastCompletedTransactionOwner(), false, 
this.pendingInflightAndRequestedInstants);
+Timer.Context indexTimer = metrics.getConflictResolutionCtx();

Review Comment:
   minor. `indexTimer` -> `conflictResolutionTimer`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #5478:
URL: https://github.com/apache/hudi/pull/5478#discussion_r965436152


##
hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/RequestHandler.java:
##
@@ -539,4 +543,19 @@ public void handle(@NotNull Context context) throws 
Exception {
   }
 }
   }
+
+  /**
+   * Determine whether to throw an exception when local view of table's 
timeline is behind that of client's view.
+   */
+  private boolean shouldThrowExceptionIfLocalViewBehind(HoodieTimeline 
localTimeline, String lastInstantTs) {

Review Comment:
   shouldn't we call this from within isLocalViewBehind(). already we have 
timline there right. we don't need to re-instantiate the timeline again in L507.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #6536: [HUDI-4736] Fix inflight clean action preventing clean service to continue when multiple cleans are not allowed

2022-09-07 Thread GitBox



nsivabalan commented on PR #6536:
URL: https://github.com/apache/hudi/pull/6536#issuecomment-1240127060

   @yihua : can you check the CI failure. please file a tracking jira for 
enhancing tests. once CI succeeds, you can go ahead and land it in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5478: [HUDI-3998] Fix getCommitsSinceLastCleaning failed when async cleaning

2022-09-07 Thread GitBox



nsivabalan commented on PR #5478:
URL: https://github.com/apache/hudi/pull/5478#issuecomment-1240126607

   also, a good practice to follow. whenever you are addressing feedback, try 
to add it as new commits. Easier for reviewer to re-review just the new 
changes. if not, I have to review entire patch again. If not, I will click on 
newer commits and will review only the newly changed code. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6031: [HUDI-4282] Repair IOException in some other dfs, except hdfs，when check block corrupted in HoodieLogFileReader

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #6031:
URL: https://github.com/apache/hudi/pull/6031#discussion_r965431934


##
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##
@@ -632,6 +635,15 @@ public static boolean isGCSFileSystem(FileSystem fs) {
 return fs.getScheme().equals(StorageSchemes.GCS.getScheme());
   }
 
+  /**
+   * Some filesystem(such as chdfs) will throw {@code IOException} instead of 
{@code EOFException}. It will cause error in isBlockCorrupted().
+   * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired 
offset is out of the file size in advance.
+   */
+  public static boolean shouldWrappedByBoundedDataStream(FileSystem fs) {

Review Comment:
   can we keep it simple for now. 
   ```
 public static boolean isCHDSFileSystem(FileSystem fs) {
   return fs.getScheme().equals(StorageSchemes.CHDS.getScheme());
 }
   ```
   
   if at all we come across other storage schemes which might need this, we can 
make it a map. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240123729

   > Thanks, today we'll test according to the new patch. If there's any news, 
we'll sync it with you again
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Gump518 commented on issue #6609: hudi upsert occured data duplication by spark streaming (cow table)

2022-09-07 Thread GitBox



Gump518 commented on issue #6609:
URL: https://github.com/apache/hudi/issues/6609#issuecomment-1240123350

   Thanks, today we'll test according to the new patch. If there's any news, 
we'll sync it with you again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6031: [HUDI-4282] Repair IOException in some other dfs, except hdfs，when check block corrupted in HoodieLogFileReader

2022-09-07 Thread GitBox



nsivabalan commented on code in PR #6031:
URL: https://github.com/apache/hudi/pull/6031#discussion_r965431369


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java:
##
@@ -516,4 +521,23 @@ private static FSDataInputStream 
getFSDataInputStreamForGCS(FSDataInputStream fs
 
 return fsDataInputStream;
   }
+
+  /**
+   * Some filesystem(such as chdfs) will throw {@code IOException} instead of 
{@code EOFException}. It will cause error in isBlockCorrupted().
+   * Wrapped by {@code BoundedFsDataInputStream}, to check whether the desired 
offset is out of the file size in advance.
+   */
+  private static FSDataInputStream 
wrapStreamByBoundedFsDataInputStream(FileSystem fs,

Review Comment:
   if we call this method in Line 490 above, we don't need lines 533 to 539 
right. 
   essentially
   line 493 could be
   ```
   return FSUtils.shouldWrappedByBoundedDataStream(fs) : new 
BoundedFsDataInputStream(fs, logFile.getPath(), fsDataInputStream): 
fsDataInputStream; 
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] santoshsb opened a new issue, #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

2022-09-07 Thread GitBox



santoshsb opened a new issue, #5452:
URL: https://github.com/apache/hudi/issues/5452

   Hi Team,
   
   We are currently evaluating Hudi for our analytical use cases and as part of 
this exercise we are facing few issues with schema evolution and data loss. The 
current issue which we have encountered is while updating a record. We have 
currently inserted a single record with the following schema 
   `
   root
|-- birthDate: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: string (nullable = true)
|-- lastUpdated: string (nullable = true)
|-- maritalStatus: struct (nullable = true)
||-- coding: array (nullable = true)
|||-- element: struct (containsNull = true)
||||-- code: string (nullable = true)
||||-- display: string (nullable = true)
||||-- system: string (nullable = true)
||-- text: string (nullable = true)
|-- resourceType: string (nullable = true)
|-- source: string (nullable = true)`
   
   now when we insert the new data with the following schema
   
   `root
|-- birthDate: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: string (nullable = true)
|-- lastUpdated: string (nullable = true)
|-- multipleBirthBoolean: boolean (nullable = true)
|-- resourceType: string (nullable = true)
|-- source: string (nullable = true)`
   
   The update is successful but the schema is missing the  
   ` |-- maritalStatus: struct (nullable = true)
||-- coding: array (nullable = true)
|||-- element: struct (containsNull = true)
||||-- code: string (nullable = true)
||||-- display: string (nullable = true)
||||-- system: string (nullable = true)
||-- text: string (nullable = true)`
   
   field.  our expected behaviour was that after adding the second entry, the 
new column "multipleBirthBoolean" will be added to the overall schema and the 
previous column  "maritalStatus" struct will be retained and will be null for 
the second entry.  The final schema looks like this, 
   `root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- birthDate: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: string (nullable = true)
|-- lastUpdated: string (nullable = true)
|-- multipleBirthBoolean: boolean (nullable = true)
|-- resourceType: string (nullable = true)
|-- source: string (nullable = true)`
   
   Basically when a new entry is added and it is missing a column from the 
destination schema the update is successful and the missing column vanishes 
from the previous entries. Let us know if we are missing any configuration 
options.  We cannot control the schema as its defined by FHIR standards 
(https://www.hl7.org/fhir/patient.html#resource) most of the fields here are 
optional so the incoming data from our customers will be missing certain 
columns.
   
   **Environment Description**
   
   * Hudi version : 0.12.0-SNAPSHOT
   
   * Spark version : 3.2.1
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Local
   
   * Running on Docker? (yes/no) : no
   
   Thanks for the help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao closed issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

2022-09-07 Thread GitBox



xiarixiaoyao closed issue #5452: Schema Evolution: Missing column for previous 
records when new entry does not have the same while upsert.
URL: https://github.com/apache/hudi/issues/5452


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

2022-09-07 Thread GitBox



xiarixiaoyao commented on issue #5452:
URL: https://github.com/apache/hudi/issues/5452#issuecomment-1240122505

   @santoshsb   you need use schema evolution and 
hoodie.datasource.write.reconcile.schema, see the follow codes
   
   ```
 def perf(spark: SparkSession) = {
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceReadOptions
   import org.apache.hudi.config.HoodieWriteConfig
   import org.apache.hudi.hive.MultiPartKeysValueExtractor
   
   //Define a Patient FHIR resource, for simplicity have deleted most of 
the elements and retained a few
   val orgString = 
"""{"resourceType":"Patient","id":"beca9a29-49bb-40e4-adff-4dbb4d664972","lastUpdated":"2022-02-14T15:18:18.90836+05:30","source":"4a0701fe-5c3b-482b-895d-875fcbd2148a","name":[{"use":"official","family":"Keeling57","given":["Serina556"],"prefix":["Ms."]}]}"""
   val sqlContext = spark.sqlContext
   import sqlContext.implicits._
   val orgStringDf = spark.read.json(Seq(orgString).toDS)
   
   //Specify common DataSourceWriteOptions in the single hudiOptions 
variable
   
   val hudiOptions = Map[String,String](
 HoodieWriteConfig.TABLE_NAME -> "patient_hudi",
 DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
 DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
 DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "source",
 DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "lastUpdated",
 DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true")
   
   //Write the orgStringDf to a Hudi table
   orgStringDf.write
 .format("org.apache.hudi")
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
 .options(hudiOptions)
 .mode(SaveMode.Overwrite)
 .save("/work/data/updateTst/hudi/json_schema_tst")
   //Read the Hudi table
   val patienthudi = 
spark.read.format("hudi").load("/work/data/updateTst/hudi/json_schema_tst")
   
   //Printschema
   patienthudi.printSchema
   //Update: Based on our usecase add a new patient resource, this resource 
might contain new columns and might not have existing columns (normal use case 
with FHIR data)
   
   val updatedString = 
"""{"resourceType":"Patient","id":"beca9a29-49bb-40e4-adff-4dbb4d664972","lastUpdated":"2022-02-14T15:18:18.90836+05:30","source":"4a0701fe-5c3b-482b-895d-875fcbd2148a","name":[{"use":"official","family":"Keeling57","given":["Serina556"]}]}"""
   
   //Convert the new resource string into DF
   val updatedStringDf = spark.read.json(Seq(updatedString).toDS)
   
   //Check the schema of the new resource that is being added
   updatedStringDf.printSchema
   
   //Upsert the new resource
   spark.sql("set hoodie.schema.on.read.enable=true")
   updatedStringDf.write
 .format("org.apache.hudi")
 .options(hudiOptions)
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, 
"org.apache.hudi.common.model.EmptyHoodieRecordPayload")
 .option("hoodie.datasource.write.reconcile.schema", "true")
 .mode(SaveMode.Append)
 .save("/work/data/updateTst/hudi/json_schema_tst")
   
   //Read the Hudi table
   val patienthudiUpdated = 
spark.read.format("hudi").load("/work/data/updateTst/hudi/json_schema_tst")
   
   //Print the schema after adding the new record
   patienthudiUpdated.printSchema
   
 }
   ```
   patienthudiUpdated.schema:
 |-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- id: string (nullable = true)
|-- lastUpdated: string (nullable = true)
|-- name: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- family: string (nullable = true)
|||-- given: array (nullable = true)
||||-- element: string (containsNull = true)
|||-- prefix: array (nullable = true)
||||-- element: string (containsNull = true)
|||-- use: string (nullable = true)
|-- resourceType: string (nullable = true)
|-- source: string (nullable = true)
   
   i think it should be ok , thanks
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:

[GitHub] [hudi] xushiyan commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-07 Thread GitBox



xushiyan commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r965411452


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -399,9 +451,65 @@ protected void writeIncomingRecords() throws IOException {
 }
   }
 
+  protected SerializableRecord cdcRecord(HoodieCDCOperation operation, String 
recordKey, String partitionPath,
+ GenericRecord oldRecord, 
GenericRecord newRecord) {
+GenericData.Record record;
+if 
(cdcSupplementalLoggingMode.equals(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE_WITH_BEFORE_AFTER))
 {
+  record = CDCUtils.cdcRecord(operation.getValue(), instantTime,

Review Comment:
   can we prefix classes with `Hoodie`? like `HoodieCDCUtils` , which is the 
convention in the codebase



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -399,9 +451,65 @@ protected void writeIncomingRecords() throws IOException {
 }
   }
 
+  protected SerializableRecord cdcRecord(HoodieCDCOperation operation, String 
recordKey, String partitionPath,

Review Comment:
   better name for this kind of method would be starting with `make` or 
`create`, easier to understand



##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/CDCFileSplit.java:
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.cdc;
+
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.util.Option;
+
+import java.io.Serializable;
+
+/**
+ * This contains all the information that retrieve the change data at a single 
file group and
+ * at a single commit.
+ *
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.ADD_BASE_FILE]], [[cdcFile]] is a 
current version of
+ *   the base file in the group, and [[beforeFileSlice]] is None.
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.REMOVE_BASE_FILE]], [[cdcFile]] is 
null,
+ *   [[beforeFileSlice]] is the previous version of the base file in the group.
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.CDC_LOG_FILE]], [[cdcFile]] is a 
log file with cdc blocks.
+ *   when enable the supplemental logging, both [[beforeFileSlice]] and 
[[afterFileSlice]] are None,
+ *   otherwise these two are the previous and current version of the base file.
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.MOR_LOG_FILE]], [[cdcFile]] is a 
normal log file and
+ *   [[beforeFileSlice]] is the previous version of the file slice.
+ * For [[cdcFileType]] = [[CDCFileTypeEnum.REPLACED_FILE_GROUP]], [[cdcFile]] 
is null,
+ *   [[beforeFileSlice]] is the current version of the file slice.
+ */
+public class CDCFileSplit implements Serializable {

Review Comment:
   HoodieCDCFileSplit



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieCDCDataBlock.java:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log.block;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+
+import org.apache.hudi.common.util.Option;
+
+import javax.annotation.Nonnull;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class HoodieCDCDataBlock extends HoodieAvroDataBlock {
+
+  public HoodieCDCDataBlock(
+

[GitHub] [hudi] nsivabalan commented on pull request #5406: [HUDI-3954] Don't keep the last commit before the earliest commit to retain

2022-09-07 Thread GitBox



nsivabalan commented on PR #5406:
URL: https://github.com/apache/hudi/pull/5406#issuecomment-1240120146

   hey @danny0405 : may be there is some rational behind the original intent. 
Its just deducting 1 commit from what user wants right. as of now, I don't feel 
this is giving us much or fixing any regression. can we drop the patch. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-09-07 Thread GitBox



yuzhaojing commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r965426924


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+Different processing modes depending on whether the meta server is enabled
+
+- Enable meta server
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a callback through the hook at each 
instant commit, thereby calling the Table Management Service to do the 
scheduling/execution for the table.
+![](service_with_meta_server.png)
+
+- Do not enable meta server
+- for every write/commit on the table, the table management server is 
notified.
+  We can set a heartbeat timeout for each hoodie table, and if it exceeds 
it, we will actively pull it once to prevent the commit request from being lost
+![](service_without_meta_server.png)
+
+### Processing flow
+
+- After receiving the request, the table management server schedules the 
relevant table service to the table's timeline
+- Persist each table service into an instance table of Table Management Service
+- notify a separate execution component/thread can start executing it
+- Monitor task execution status, update table information, and retry failed 
table services up to the maximum number of times
+
+### Storage
+
+- There are two types of stored information
+- Register with the hoodie table of the Table Management Service
+- Each table service instance is generated by Table Management Service
+
+ Lectotype
+
+**Requirements:** support single row ACID transactions. Almost all write 
operations require it, like operation creation,
+status changing and so on.
+
+There are the candidates,
+
+**Hudi table**
+
+pros:
+
+- No external components are introduced and maintained.
+
+crons:
+
+- Each write to hudi table will be a deltacommit, this will further lower the 
number of possible requests / sec that can
+  be served.
+
+**RDBMS**
+
+pros:
+
+- database that is suitable for structured data like metadata to store.
+
+- can describe the relation between many

[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-09-07 Thread GitBox



yuzhaojing commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r965424691


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+Different processing modes depending on whether the meta server is enabled
+
+- Enable meta server
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a callback through the hook at each 
instant commit, thereby calling the Table Management Service to do the 
scheduling/execution for the table.
+![](service_with_meta_server.png)
+
+- Do not enable meta server
+- for every write/commit on the table, the table management server is 
notified.
+  We can set a heartbeat timeout for each hoodie table, and if it exceeds 
it, we will actively pull it once to prevent the commit request from being lost
+![](service_without_meta_server.png)
+
+### Processing flow
+
+- After receiving the request, the table management server schedules the 
relevant table service to the table's timeline
+- Persist each table service into an instance table of Table Management Service
+- notify a separate execution component/thread can start executing it
+- Monitor task execution status, update table information, and retry failed 
table services up to the maximum number of times
+
+### Storage
+
+- There are two types of stored information
+- Register with the hoodie table of the Table Management Service
+- Each table service instance is generated by Table Management Service
+
+ Lectotype
+
+**Requirements:** support single row ACID transactions. Almost all write 
operations require it, like operation creation,
+status changing and so on.
+
+There are the candidates,
+
+**Hudi table**
+
+pros:
+
+- No external components are introduced and maintained.
+
+crons:
+
+- Each write to hudi table will be a deltacommit, this will further lower the 
number of possible requests / sec that can
+  be served.
+
+**RDBMS**
+
+pros:
+
+- database that is suitable for structured data like metadata to store.
+
+- can describe the relation between many

[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-09-07 Thread GitBox



yuzhaojing commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r965424454


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+Different processing modes depending on whether the meta server is enabled
+
+- Enable meta server
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a callback through the hook at each 
instant commit, thereby calling the Table Management Service to do the 
scheduling/execution for the table.
+![](service_with_meta_server.png)
+
+- Do not enable meta server
+- for every write/commit on the table, the table management server is 
notified.
+  We can set a heartbeat timeout for each hoodie table, and if it exceeds 
it, we will actively pull it once to prevent the commit request from being lost
+![](service_without_meta_server.png)
+
+### Processing flow
+
+- After receiving the request, the table management server schedules the 
relevant table service to the table's timeline
+- Persist each table service into an instance table of Table Management Service
+- notify a separate execution component/thread can start executing it
+- Monitor task execution status, update table information, and retry failed 
table services up to the maximum number of times
+
+### Storage
+
+- There are two types of stored information
+- Register with the hoodie table of the Table Management Service
+- Each table service instance is generated by Table Management Service
+
+ Lectotype
+
+**Requirements:** support single row ACID transactions. Almost all write 
operations require it, like operation creation,
+status changing and so on.
+
+There are the candidates,
+
+**Hudi table**
+
+pros:
+
+- No external components are introduced and maintained.
+
+crons:
+
+- Each write to hudi table will be a deltacommit, this will further lower the 
number of possible requests / sec that can
+  be served.
+
+**RDBMS**
+
+pros:
+
+- database that is suitable for structured data like metadata to store.
+
+- can describe the relation between many

[GitHub] [hudi] nsivabalan commented on issue #6552: [SUPPORT] AWSDmsAvroPayload does not work correctly with any version above 0.10.0

2022-09-07 Thread GitBox



nsivabalan commented on issue #6552:
URL: https://github.com/apache/hudi/issues/6552#issuecomment-1240113529

   yeah. Udit pointed out the right commit. 
   here is the fix that worked out for me locally. 
   
   ```
   diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java 
b/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java
   index 20a20fb629..a3c6dde99e 100644
   --- 
a/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java
   +++ 
b/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java
   @@ -69,21 +69,21 @@ public class AWSDmsAvroPayload extends 
OverwriteWithLatestAvroPayload {

  @Override
  public Option getInsertValue(Schema schema, Properties 
properties) throws IOException {
   -IndexedRecord insertValue = super.getInsertValue(schema, 
properties).get();
   -return handleDeleteOperation(insertValue);
   +Option insertValue = super.getInsertValue(schema, 
properties);
   +return insertValue.isPresent() ? 
handleDeleteOperation(insertValue.get()) : insertValue;
  }

  @Override
  public Option getInsertValue(Schema schema) throws 
IOException {
   -IndexedRecord insertValue = super.getInsertValue(schema).get();
   -return handleDeleteOperation(insertValue);
   +Option insertValue = super.getInsertValue(schema);
   +return insertValue.isPresent() ? 
handleDeleteOperation(insertValue.get()) : insertValue;
  }

  @Override
  public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema, Properties properties)
  throws IOException {
   -IndexedRecord insertValue = super.getInsertValue(schema, 
properties).get();
   -return handleDeleteOperation(insertValue);
   +Option insertValue = super.getInsertValue(schema, 
properties);
   +return insertValue.isPresent() ? 
handleDeleteOperation(insertValue.get()) : insertValue;
  }

   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-09-07 Thread GitBox



yuzhaojing commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r965423222


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+Different processing modes depending on whether the meta server is enabled
+
+- Enable meta server
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a callback through the hook at each 
instant commit, thereby calling the Table Management Service to do the 
scheduling/execution for the table.
+![](service_with_meta_server.png)
+
+- Do not enable meta server
+- for every write/commit on the table, the table management server is 
notified.
+  We can set a heartbeat timeout for each hoodie table, and if it exceeds 
it, we will actively pull it once to prevent the commit request from being lost
+![](service_without_meta_server.png)
+
+### Processing flow
+
+- After receiving the request, the table management server schedules the 
relevant table service to the table's timeline
+- Persist each table service into an instance table of Table Management Service
+- notify a separate execution component/thread can start executing it
+- Monitor task execution status, update table information, and retry failed 
table services up to the maximum number of times
+
+### Storage
+
+- There are two types of stored information
+- Register with the hoodie table of the Table Management Service
+- Each table service instance is generated by Table Management Service
+
+ Lectotype

Review Comment:
   Will update it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle

2022-09-07 Thread GitBox



danny0405 commented on code in PR #6628:
URL: https://github.com/apache/hudi/pull/6628#discussion_r965422148


##
packaging/hudi-flink-bundle/pom.xml:
##
@@ -501,8 +501,7 @@
 
   org.apache.avro
   avro
-  
-  1.10.0
+  ${avro.version}

Review Comment:
   Make sure that there are no compatibility issues in flink side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6548: [HUDI-4749] Fixing full cleaning to leverage metadata table

2022-09-07 Thread GitBox



hudi-bot commented on PR #6548:
URL: https://github.com/apache/hudi/pull/6548#issuecomment-1240110458

   
   ## CI report:
   
   * 78d0f8bb6487e55b91443dcade5285e4a2412e3b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11229)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4766) Fix HoodieFlinkClusteringJob

2022-09-07 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4766:
-
Fix Version/s: 0.12.1

> Fix HoodieFlinkClusteringJob
> 
>
> Key: HUDI-4766
> URL: https://issues.apache.org/jira/browse/HUDI-4766
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> h1. Flink Hudi Clustering Issues
>  
>  # Integer type used for byte-size configuration parameters instead of long
>  ** Maximum size range of 2^31-1 bytes ~2 gigabytes
>  # Unable to choose a particular instant to execute
>  # Unable to select filter mode as the method that controls this is 
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
>  # No cleaning
>  ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
> only enabled if _clean.async.enabled = false._
>  # Schedule configuration is not consistent with HoodieFlinkCompactor 
> defining the flag = false, which is opposite of HoodieFlinkCompactor
>  # No ability to allow props to be passed in using _--props/–hoodie-conf_
>  ** Required for passing in configurations like:
>  *** _hoodie.parquet.compression.ratio_
>  *** Partition filter configurations depending on strategy
>  # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
> (120MB by default)
>  # Multiple clustering jobs can execute, but no fine-grain control over 
> restarting jobs that have failed. Current implementation will only filter for 
> REQUESTED clustering jobs; rollbacks will never be performed.
>  # Removed unused _getNumberOfOutputFileGroups()_ function.
>  ** _hoodie.clustering.plan.strategy.small.file.limit_
>  ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
>  ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
>  ** Will create N file groups (1 task will be writing to each file group, 
> increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4766) Fix HoodieFlinkClusteringJob

2022-09-07 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4766.
--

> Fix HoodieFlinkClusteringJob
> 
>
> Key: HUDI-4766
> URL: https://issues.apache.org/jira/browse/HUDI-4766
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
>
> h1. Flink Hudi Clustering Issues
>  
>  # Integer type used for byte-size configuration parameters instead of long
>  ** Maximum size range of 2^31-1 bytes ~2 gigabytes
>  # Unable to choose a particular instant to execute
>  # Unable to select filter mode as the method that controls this is 
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
>  # No cleaning
>  ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
> only enabled if _clean.async.enabled = false._
>  # Schedule configuration is not consistent with HoodieFlinkCompactor 
> defining the flag = false, which is opposite of HoodieFlinkCompactor
>  # No ability to allow props to be passed in using _--props/–hoodie-conf_
>  ** Required for passing in configurations like:
>  *** _hoodie.parquet.compression.ratio_
>  *** Partition filter configurations depending on strategy
>  # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
> (120MB by default)
>  # Multiple clustering jobs can execute, but no fine-grain control over 
> restarting jobs that have failed. Current implementation will only filter for 
> REQUESTED clustering jobs; rollbacks will never be performed.
>  # Removed unused _getNumberOfOutputFileGroups()_ function.
>  ** _hoodie.clustering.plan.strategy.small.file.limit_
>  ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
>  ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
>  ** Will create N file groups (1 task will be writing to each file group, 
> increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4766) Fix HoodieFlinkClusteringJob

2022-09-07 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601571#comment-17601571
 ] 

Danny Chen commented on HUDI-4766:
--

Fixed via master branch: adf36093d2454c7e3cd7090a0cb3fd5af140b919

> Fix HoodieFlinkClusteringJob
> 
>
> Key: HUDI-4766
> URL: https://issues.apache.org/jira/browse/HUDI-4766
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> h1. Flink Hudi Clustering Issues
>  
>  # Integer type used for byte-size configuration parameters instead of long
>  ** Maximum size range of 2^31-1 bytes ~2 gigabytes
>  # Unable to choose a particular instant to execute
>  # Unable to select filter mode as the method that controls this is 
> overridden by _FlinkSizeBasedClusteringPlanStrategy#filterPartitionPaths_
>  # No cleaning
>  ** With reference to OfflineCompaction (HoodieFlinkCompactor), cleaning is 
> only enabled if _clean.async.enabled = false._
>  # Schedule configuration is not consistent with HoodieFlinkCompactor 
> defining the flag = false, which is opposite of HoodieFlinkCompactor
>  # No ability to allow props to be passed in using _--props/–hoodie-conf_
>  ** Required for passing in configurations like:
>  *** _hoodie.parquet.compression.ratio_
>  *** Partition filter configurations depending on strategy
>  # Clustering group will spit out files of _hoodie.parquet.max.file.size_ 
> (120MB by default)
>  # Multiple clustering jobs can execute, but no fine-grain control over 
> restarting jobs that have failed. Current implementation will only filter for 
> REQUESTED clustering jobs; rollbacks will never be performed.
>  # Removed unused _getNumberOfOutputFileGroups()_ function.
>  ** _hoodie.clustering.plan.strategy.small.file.limit_
>  ** _hoodie.clustering.plan.strategy.max.bytes.per.group_
>  ** _hoodie.clustering.plan.strategy.target.file.max.bytes_
>  ** Will create N file groups (1 task will be writing to each file group, 
> increasing parallelism)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yuzhaojing commented on a diff in pull request #4309: [HUDI-3016][RFC-43] Proposal to implement Table Management Service

2022-09-07 Thread GitBox



yuzhaojing commented on code in PR #4309:
URL: https://github.com/apache/hudi/pull/4309#discussion_r965421349


##
rfc/rfc-43/rfc-43.md:
##
@@ -0,0 +1,316 @@
+
+
+# RFC-43: Implement Table Management ServiceTable Management Service for Hudi
+
+## Proposers
+
+- @yuzhaojing
+
+## Approvers
+
+- @vinothchandar
+- @Raymond
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3016](https://issues.apache.org/jira/browse/HUDI-3016)
+
+## Abstract
+
+Hudi table needs table management operations. Currently, schedule these job 
provides Three ways:
+
+- Inline, execute these job and writing job in the same application, perform 
the these job and writing job serially.
+
+- Async, execute these job and writing job in the same application, Async 
parallel execution of these job and write job.
+
+- Independent compaction/clustering job, execute an async 
compaction/clustering job of another application.
+
+With the increase in the number of HUDI tables, due to a lack of management 
capabilities, maintenance costs will become
+higher. This proposal is to implement an independent compaction/clustering 
Service to manage the Hudi
+compaction/clustering job.
+
+## Background
+
+In the current implementation, if the HUDI table needs do compact/cluster, it 
only has three ways:
+
+1. Use inline compaction/clustering, in this mode the job will be block 
writing job.
+
+2. Using Async compaction/clustering, in this mode the job execute async but 
also sharing the resource with HUDI to
+   write a job that may affect the stability of job writing, which is not what 
the user wants to see.
+
+3. Using independent compaction/clustering job is a better way to schedule the 
job, in this mode the job execute async
+   and do not sharing resources with writing job, but also has some questions:
+1. Users have to enable lock service providers so that there is not data 
loss. Especially when compaction/clustering
+   is getting scheduled, no other writes should proceed concurrently and 
hence a lock is required.
+2. The user needs to manually start an async compaction/clustering 
application, which means that the user needs to
+   maintain two jobs.
+3. With the increase in the number of HUDI jobs, there is no unified 
service to manage compaction/clustering jobs (
+   monitor, retry, history, etc...), which will make maintenance costs 
increase.
+
+With this effort, we want to provide an independent compaction/clustering 
Service, it will have these abilities:
+
+- Provides a pluggable execution interface that can adapt to multiple 
execution engines, such as Spark and Flink.
+
+- With the ability to failover, need to be persisted compaction/clustering 
message.
+
+- Perfect metrics and reuse HoodieMetric expose to the outside.
+
+- Provide automatic failure retry for compaction/clustering job.
+
+## Implementation
+
+### Processing mode
+Different processing modes depending on whether the meta server is enabled
+
+- Enable meta server
+- The pull-based mechanism works for fewer tables. Scanning 1000s of 
tables for possible services is going to induce lots of a load of listing.
+- The meta server provides a listener that takes as input the uris of the 
Table Management Service and triggers a callback through the hook at each 
instant commit, thereby calling the Table Management Service to do the 
scheduling/execution for the table.
+![](service_with_meta_server.png)
+
+- Do not enable meta server
+- for every write/commit on the table, the table management server is 
notified.
+  We can set a heartbeat timeout for each hoodie table, and if it exceeds 
it, we will actively pull it once to prevent the commit request from being lost
+![](service_without_meta_server.png)
+
+### Processing flow
+
+- After receiving the request, the table management server schedules the 
relevant table service to the table's timeline

Review Comment:
   I mean scheduling the corresponding table service to the hudi table‘s 
timeline on storage via TMS



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (e8aee84c7c -> adf36093d2)

2022-09-07 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e8aee84c7c [HUDI-4793] Fixing ScalaTest tests to properly respect 
Log4j2 configs (#6617)
 add adf36093d2 [HUDI-4766] Strengthen flink clustering job (#6566)

No new revisions were added by this update.

Summary of changes:
 .../FlinkSizeBasedClusteringPlanStrategy.java  |  8 ---
 .../apache/hudi/configuration/FlinkOptions.java| 12 ++---
 .../hudi/sink/clustering/ClusteringCommitSink.java |  7 +++
 .../hudi/sink/clustering/ClusteringOperator.java   |  5 ++
 .../sink/clustering/FlinkClusteringConfig.java | 61 ++
 .../sink/clustering/HoodieFlinkClusteringJob.java  | 60 +
 .../hudi/sink/compact/HoodieFlinkCompactor.java| 16 ++
 .../java/org/apache/hudi/util/StreamerUtil.java|  4 +-
 8 files changed, 136 insertions(+), 37 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #6566: [HUDI-4766] Strengthen flink clustering job

2022-09-07 Thread GitBox



danny0405 merged PR #6566:
URL: https://github.com/apache/hudi/pull/6566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6625:
URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240107831

   
   ## CI report:
   
   * 5f385a174df1fa344b87a3a4ada3f3f6d61f1d76 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11225)
 
   * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11233)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6625: [HUDI-4799] improve analyzer exception tip when can not resolve expre…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6625:
URL: https://github.com/apache/hudi/pull/6625#issuecomment-1240104388

   
   ## CI report:
   
   * 5f385a174df1fa344b87a3a4ada3f3f6d61f1d76 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11225)
 
   * a6d1f537e3a4fee7b9fb913de0ab531fc8d4be83 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6575: [HUDI-4754] Add compliance check in github actions

2022-09-07 Thread GitBox



hudi-bot commented on PR #6575:
URL: https://github.com/apache/hudi/pull/6575#issuecomment-1240093961

   
   ## CI report:
   
   * 1600e31836157c8d05e3bc8b9e08e1717471f1a6 UNKNOWN
   * 4d02f2c64a5fc4b89889677ee639a20b53cec26a UNKNOWN
   * 48147d19c835e7868102fd2d083659e6ee2ac343 UNKNOWN
   * c644d730766bcb7c00c7b427d73e56a1c63dbb3a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11228)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-07 Thread GitBox



xushiyan commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r956951504


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java:
##
@@ -116,12 +125,18 @@ public List close() {
 String key = newRecordKeysSorted.poll();
 HoodieRecord hoodieRecord = keyToNewRecords.get(key);
 if (!writtenRecordKeys.contains(hoodieRecord.getRecordKey())) {
+  Option insertRecord;
   if (useWriterSchemaForCompaction) {
-writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, 
config.getProps()));
+insertRecord = 
hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, 
config.getProps());
   } else {
-writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(tableSchema, config.getProps()));
+insertRecord = hoodieRecord.getData().getInsertValue(tableSchema, 
config.getProps());
   }
+  writeRecord(hoodieRecord, insertRecord);

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-09-07 Thread Yao Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17601562#comment-17601562
 ] 

Yao Zhang commented on HUDI-4485:
-

Hi [~codope] ,

Finally all unit test issues have been resolved and CI passed. Could you please 
help review this PR? Thank you very much.

> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version : 3.1.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: spring-shell-1.2.0.RELEASE.jar
>
>
> This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
> fsview all · Issue #6177 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/6177]
> {*}{{*}}Describe the problem you faced{{*}}{*}
> Hudi cli got empty result after running command show fsview all.
> ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])
> The type of table t1 is COW and I am sure that the parquet file is actually 
> generated inside data folder. Also, the parquet files are not damaged as the 
> data could be retrieved correctly by reading as Hudi table or directly 
> reading each parquet file(using Spark).
> {*}{{*}}To Reproduce{{*}}{*}
> Steps to reproduce the behavior:
> 1. Enter Flink SQL client.
> 2. Execute the SQL and check the data was written successfully.
> ```sql
> CREATE TABLE t1(
> uuid VARCHAR(20),
> name VARCHAR(10),
> age INT,
> ts TIMESTAMP(3),
> `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
> 'connector' = 'hudi',
> 'path' = 'hdfs:///path/to/table/',
> 'table.type' = 'COPY_ON_WRITE'
> );
> – insert data using values
> INSERT INTO t1 VALUES
> ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
> ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
> ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
> ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
> ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
> ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
> ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
> ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
> ```
> 3. Enter Hudi cli and execute `show fsview all`
> {*}{{*}}Expected behavior{{*}}{*}
> `show fsview all` in Hudi cli should return all file slices.
> {*}{{*}}Environment Description{{*}}{*}
>  * Hudi version : 0.11.1
>  * Spark version : 3.1.1
>  * Hive version : 3.1.0
>  * Hadoop version : 3.1.1
>  * Storage (HDFS/S3/GCS..) : HDFS
>  * Running on Docker? (yes/no) : no
> {*}{{*}}Additional context{{*}}{*}
> No.
> {*}{{*}}Stacktrace{{*}}{*}
> N/A
>  
> Temporary solution：
> I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the 
> attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] paul8263 commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-09-07 Thread GitBox



paul8263 commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1240058067

   > Hi @codope and @yihua , Errors of hudi-integ-test are almost cleared. The 
only one left is:
   > 
   > 
org.apache.hudi.integ.command.ITTestHoodieSyncCommand.testValidateSync(ITTestHoodieSyncCommand.java:56)
   > 
   > 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11183=logs=3b6e910d-b98f-5de6-b9cb-1e5ff571f5de=30b5aae4-0ea0-5566-42d0-febf71a7061a=146906
   > 
   > Is there a way to view the detailed error log in the docker container via 
Azure?
   
   FInally all test failures has been resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6520: [HUDI-4726] Incremental input splits result is not as expected when f…

2022-09-07 Thread GitBox



hudi-bot commented on PR #6520:
URL: https://github.com/apache/hudi/pull/6520#issuecomment-1240042837

   
   ## CI report:
   
   * e55d28bdafa64d4a5180fd46191a420e702a58dc UNKNOWN
   * 360821a2d0110a82ac3c56eb65bcc3ad9b9659bf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11227)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6628: [HUDI-4806] Use Avro version from the root pom for Flink bundle

2022-09-07 Thread GitBox



hudi-bot commented on PR #6628:
URL: https://github.com/apache/hudi/pull/6628#issuecomment-1240039267

   
   ## CI report:
   
   * 2504fd6b17a7a3fb2a77f755d7fe6b6c7f83c96f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11232)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 >

1 - 100 of 374 matches

Mail list logo