date:20220816

[GitHub] [hudi] xicm commented on issue #6414: [SUPPORT] Spark3 with Hadoop3 using metadata could have compatible issue when reading hfile

2022-08-16 Thread GitBox



xicm commented on issue #6414:
URL: https://github.com/apache/hudi/issues/6414#issuecomment-1217539046

   https://github.com/apache/hudi/issues/5765#issuecomment-1179915480
   According this, I mvn install hbase with hadoop3, and then pacakge hudi, the 
problem has been resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-16 Thread GitBox



xushiyan commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r947510396


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,132 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 
+configuration. This unlocks new opportunities for Hudi users. For example, one 
can retain commits for years, by adding 
+one savepoint per day for older commits (say > 30 days old). And they can 
query hudi using `as.of.instant` semantics for
+old data. In previous versions, one would have to retain every commit and let 
archival stop at the first commit.
+
+:::note
+However, if this feature is enabled, restore cannot be supported. This 
limitation would be relaxed in a future release 
+and the development of this feature can be tracked in 
[HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500).
+:::
+
+### Post Write Termination Strategy with Deltastreamer
+
+Users can now configure a post write termination strategy with deltastreamer 
`continuous` mode if need be. For instance,
+users can configure graceful shutdown if there is no new data from source for 
5 consecutive times. Here is the interface
+for the termination strategy.
+```java
+/**
+ * Post write termination strategy for deltastreamer in continuous mode.
+ */
+public interface PostWriteTerminationStrategy {
+
+  /**
+   * Returns whether deltastreamer needs to be shutdown.
+   * @param scheduledCompactionInstantAndWriteStatuses optional pair of 
scheduled compaction instant and write statuses.
+   * @return true if deltastreamer has to be shutdown. false otherwise.
+   */
+  boolean shouldShutdown(Option, JavaRDD>> 
scheduledCompactionInstantAndWriteStatuses);
+
+}
+```
+
+Also, this might help in bootstrapping a new table. Instead of doing one bulk 
load or bulk_insert leveraging a large
+cluster for a large input of data, one could start deltastreamer on continuous 
mode and add a shutdown strategy to 
+terminate, once all data has been bootstrapped. This way, each batch could be 
smaller and may not need a large cluster 
+to bootstrap data. We have one concrete implementation out of the box, 
[NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java).
+Users can feel free to implement their own strategy as they see fit.
+
+### Performance Improvements
+
+While some performance regressions were addressed in [0.11.1 
release](/releases/release-0.11.1#addressing-performance-regression-in-0110), 
+this version brings more improvements to make Hudi the most performant lake 
storage format. Some notable improvements are:
+- Closed the performance gap in writing through Spark datasource vs sql. 
Previously, datasource writes were faster.
+- All built-in key generators implement more performant Spark-specific APIs.
+- Replaced UDF in bulk insert operation with RDD transformation to cut down 
serde cost.
+- Optimized column stats index performance in data skipping.
+
+We recently benchmarked Hudi against TPC-DS workload.
+Please check out [our 
blog](/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks)
 for more details.
+
+### SQL Support Improvements
+
+- Support for upgrade, downgrade, bootstrap, clean, rollback and repair 
through `Call Procedure` command.
+- Support for `analyze table`.
+- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL.
+
+### Flink Integration Improvements
+
+- Flink 1.15.x is integrated, use profile param `-Pflink1.15` when compiling 
the codes to adapt the version. Alternatively, use `hudi-flink1.15-bundle`.
+- Data skipping is supported for batch mode read, set up SQL option 
`metadata.enabled`, `hoodie.metadata.index.column.stats.enable`  and 
`read.data.skipping.enabled` as true to enable it.
+- A HMS-based Flink catalog is added with catalog identifier as `hudi`. You 
can instantiate the catalog through API directly or use the `CREATE CATALOG`  
syntax to create it. Specifies catalog option `'mode' = 'hms'`  to switch to 
the HMS catalog. By default, the catalog is in `dfs` mode.
+- Async clustering is supported for Flink `INSERT` operation, set up SQL 
option `clustering.sc

[GitHub] [hudi] xushiyan commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-16 Thread GitBox



xushiyan commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r947509056


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,132 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 
+configuration. This unlocks new opportunities for Hudi users. For example, one 
can retain commits for years, by adding 
+one savepoint per day for older commits (say > 30 days old). And they can 
query hudi using `as.of.instant` semantics for
+old data. In previous versions, one would have to retain every commit and let 
archival stop at the first commit.
+
+:::note
+However, if this feature is enabled, restore cannot be supported. This 
limitation would be relaxed in a future release 
+and the development of this feature can be tracked in 
[HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500).
+:::
+
+### Post Write Termination Strategy with Deltastreamer
+
+Users can now configure a post write termination strategy with deltastreamer 
`continuous` mode if need be. For instance,
+users can configure graceful shutdown if there is no new data from source for 
5 consecutive times. Here is the interface
+for the termination strategy.
+```java
+/**
+ * Post write termination strategy for deltastreamer in continuous mode.
+ */
+public interface PostWriteTerminationStrategy {
+
+  /**
+   * Returns whether deltastreamer needs to be shutdown.
+   * @param scheduledCompactionInstantAndWriteStatuses optional pair of 
scheduled compaction instant and write statuses.
+   * @return true if deltastreamer has to be shutdown. false otherwise.
+   */
+  boolean shouldShutdown(Option, JavaRDD>> 
scheduledCompactionInstantAndWriteStatuses);
+
+}
+```
+
+Also, this might help in bootstrapping a new table. Instead of doing one bulk 
load or bulk_insert leveraging a large
+cluster for a large input of data, one could start deltastreamer on continuous 
mode and add a shutdown strategy to 
+terminate, once all data has been bootstrapped. This way, each batch could be 
smaller and may not need a large cluster 
+to bootstrap data. We have one concrete implementation out of the box, 
[NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java).
+Users can feel free to implement their own strategy as they see fit.
+
+### Performance Improvements
+
+While some performance regressions were addressed in [0.11.1 
release](/releases/release-0.11.1#addressing-performance-regression-in-0110), 
+this version brings more improvements to make Hudi the most performant lake 
storage format. Some notable improvements are:
+- Closed the performance gap in writing through Spark datasource vs sql. 
Previously, datasource writes were faster.
+- All built-in key generators implement more performant Spark-specific APIs.
+- Replaced UDF in bulk insert operation with RDD transformation to cut down 
serde cost.
+- Optimized column stats index performance in data skipping.
+
+We recently benchmarked Hudi against TPC-DS workload.
+Please check out [our 
blog](/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks)
 for more details.
+
+### SQL Support Improvements
+
+- Support for upgrade, downgrade, bootstrap, clean, rollback and repair 
through `Call Procedure` command.
+- Support for `analyze table`.
+- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL.
+
+### Flink Integration Improvements
+
+- Flink 1.15.x is integrated, use profile param `-Pflink1.15` when compiling 
the codes to adapt the version. Alternatively, use `hudi-flink1.15-bundle`.

Review Comment:
   i'd put spark 3.3 support and flink 1.15 support in early sections since 
these are big features to highlight. in general i'd rank features from major 
impact to minor



##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,132 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since versio

[jira] [Updated] (HUDI-3654) Support basic actions based on hudi metastore server

2022-08-16 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3654:
-
Sprint: 2022/08/22  (was: 2022/08/08)

> Support basic actions based on hudi metastore server 
> -
>
> Key: HUDI-3654
> URL: https://issues.apache.org/jira/browse/HUDI-3654
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4549) hive sync bundle causes class loader issue

2022-08-16 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4549:
-
Epic Link: HUDI-3529

> hive sync bundle causes class loader issue
> --
>
> Key: HUDI-4549
> URL: https://issues.apache.org/jira/browse/HUDI-4549
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.12.1
>
>
> A weird classpath issue i found: when testing deltastreamer using 
> hudi-utilities-slim-bundle, if i put --jars 
> hudi-hive-sync-bundle.jar,hudi-spark-bundle.jar then i’ll get this error when 
> writing
> {code:java}
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hudi.avro.MercifulJsonConverter.convert(Ljava/lang/String;Lorg/apache/avro/Schema;)Lorg/apache/avro/generic/GenericRecord;
>   at 
> org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJson(AvroConvertor.java:86)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
> {code}
> if i put the spark bundle before the hive sync bundle, then no issue. Without 
> hive-sync-bundle, also no issue. So hive-sync-bundle somehow messes up with 
> classpath? not sure why it reports a hudi-common API not found… caused by 
> shading avro?
> the same behavior i observed with aws-bundle, which makes sense, as it’s a 
> superset of hive-sync-bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4632) Remove the force active property for flink1.14 profile

2022-08-16 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4632.
--

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4632) Remove the force active property for flink1.14 profile

2022-08-16 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580613#comment-17580613
 ] 

Danny Chen commented on HUDI-4632:
--

Fixed via master branch: 9055b2f65d670c2b153c7492ca323188b84c3362

> Remove the force active property for flink1.14 profile
> --
>
> Key: HUDI-4632
> URL: https://issues.apache.org/jira/browse/HUDI-4632
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.11.1
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-4632] Remove the force active property for flink1.14 profile (#6415)

2022-08-16 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9055b2f65d [HUDI-4632] Remove the force active property for flink1.14 
profile (#6415)
9055b2f65d is described below

commit 9055b2f65d670c2b153c7492ca323188b84c3362
Author: Danny Chan 
AuthorDate: Wed Aug 17 14:22:10 2022 +0800

[HUDI-4632] Remove the force active property for flink1.14 profile (#6415)
---
 pom.xml | 2 --
 1 file changed, 2 deletions(-)

diff --git a/pom.xml b/pom.xml
index 54be08f6c0..66a9335937 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1860,8 +1860,6 @@
 true
 
   flink1.14
-  
-  !disabled

[GitHub] [hudi] danny0405 merged pull request #6415: [HUDI-4632] Remove the force active property for flink1.14 profile

2022-08-16 Thread GitBox



danny0405 merged PR #6415:
URL: https://github.com/apache/hudi/pull/6415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-4529) Tweak some default config options for flink

2022-08-16 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4529.
--

> Tweak some default config options for flink
> ---
>
> Key: HUDI-4529
> URL: https://issues.apache.org/jira/browse/HUDI-4529
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4529) Tweak some default config options for flink

2022-08-16 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580612#comment-17580612
 ] 

Danny Chen commented on HUDI-4529:
--

Fixed via master branch: 1ad0e9560e4805b682fe661d78f2ad0f2fa1025b

> Tweak some default config options for flink
> ---
>
> Key: HUDI-4529
> URL: https://issues.apache.org/jira/browse/HUDI-4529
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on pull request #6287: [HUDI-4529] Tweak some default config options for flink

2022-08-16 Thread GitBox



danny0405 commented on PR #6287:
URL: https://github.com/apache/hudi/pull/6287#issuecomment-1217510928

   The failed test case in `hudi-utilities` should not be caused by this patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4529] Tweak some default config options for flink (#6287)

2022-08-16 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1ad0e9560e [HUDI-4529] Tweak some default config options for flink 
(#6287)
1ad0e9560e is described below

commit 1ad0e9560e4805b682fe661d78f2ad0f2fa1025b
Author: Danny Chan 
AuthorDate: Wed Aug 17 14:20:36 2022 +0800

[HUDI-4529] Tweak some default config options for flink (#6287)
---
 .../apache/hudi/configuration/FlinkOptions.java| 10 +++
 .../apache/hudi/streamer/FlinkStreamerConfig.java  |  6 ++--
 .../org/apache/hudi/table/HoodieTableFactory.java  | 34 ++
 .../apache/hudi/table/TestHoodieTableFactory.java  | 19 +---
 4 files changed, 38 insertions(+), 31 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index 164106a4e8..3638113288 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -22,13 +22,13 @@ import 
org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringP
 import org.apache.hudi.common.config.ConfigClassProperty;
 import org.apache.hudi.common.config.ConfigGroups;
 import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.model.EventTimeAvroPayload;
 import org.apache.hudi.common.model.HoodieCleaningPolicy;
 import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
 import org.apache.hudi.config.HoodieIndexConfig;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor;
+import org.apache.hudi.hive.MultiPartKeysValueExtractor;
 import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
 import org.apache.hudi.keygen.constant.KeyGeneratorType;
@@ -287,7 +287,7 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption PAYLOAD_CLASS_NAME = ConfigOptions
   .key("write.payload.class")
   .stringType()
-  .defaultValue(OverwriteWithLatestAvroPayload.class.getName())
+  .defaultValue(EventTimeAvroPayload.class.getName())
   .withDescription("Payload class used. Override this, if you like to roll 
your own merge logic, when upserting/inserting.\n"
   + "This will render any value set for the option in-effective");
 
@@ -718,7 +718,7 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption HIVE_SYNC_MODE = ConfigOptions
   .key("hive_sync.mode")
   .stringType()
-  .defaultValue("jdbc")
+  .defaultValue("hms")
   .withDescription("Mode to choose for Hive ops. Valid values are hms, 
jdbc and hiveql, default 'jdbc'");
 
   public static final ConfigOption HIVE_SYNC_USERNAME = ConfigOptions
@@ -754,7 +754,7 @@ public class FlinkOptions extends HoodieConfig {
   public static final ConfigOption 
HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME = ConfigOptions
   .key("hive_sync.partition_extractor_class")
   .stringType()
-  
.defaultValue(SlashEncodedDayPartitionValueExtractor.class.getCanonicalName())
+  .defaultValue(MultiPartKeysValueExtractor.class.getName())
   .withDescription("Tool to extract the partition value from HDFS path, "
   + "default 'SlashEncodedDayPartitionValueExtractor'");
 
diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
index 3ba1c6230f..3447a23851 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
@@ -24,7 +24,7 @@ import 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
 import org.apache.hudi.common.model.WriteOperationType;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.configuration.FlinkOptions;
-import org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor;
+import org.apache.hudi.hive.MultiPartKeysValueExtractor;
 import org.apache.hudi.keygen.constant.KeyGeneratorType;
 import org.apache.hudi.util.FlinkStateBackendConverter;
 import org.apache.hudi.util.StreamerUtil;
@@ -321,8 +321,8 @@ public class FlinkStreamerConfig extends Configuration {
   public String hiveSyncPartitionFields = "";
 
   @Parameter(names = {"--hive-sync-partition-extractor-class"}, description = 
"Tool to extract the partition value from

[GitHub] [hudi] danny0405 merged pull request #6287: [HUDI-4529] Tweak some default config options for flink

2022-08-16 Thread GitBox



danny0405 merged PR #6287:
URL: https://github.com/apache/hudi/pull/6287


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 commented on issue #6412: [SUPPORT]query between 0 and max commit time yields empty result set.

2022-08-16 Thread GitBox



bithw1 commented on issue #6412:
URL: https://github.com/apache/hudi/issues/6412#issuecomment-1217506808

   Thanks @boneanxs for the helpful answer. 
   
   I tried with `where _hoodie_commit_time > `0` and _hoodie_commit_time <= 
$T2`, and I got all the data.
   
   But ,i didn't see the difference between 
   `where _hoodie_commit_time > 0 and _hoodie_commit_time <= $T2`
   and 
   `where _hoodie_commit_time > $T1 and _hoodie_commit_time <= $T2`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-08-16 Thread GitBox



hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1217502926

   
   ## CI report:
   
   * 3e4361accdd100bebd942d54151236ed971046e1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10787)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6287: [HUDI-4529] Tweak some default config options for flink

2022-08-16 Thread GitBox



hudi-bot commented on PR #6287:
URL: https://github.com/apache/hudi/pull/6287#issuecomment-1217499017

   
   ## CI report:
   
   * a5e5c56ec81966d4eabb6d7f8dfdcae3c141a7f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10786)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Update DOAP with 0.12.0 Release (#6413)

2022-08-16 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 042241fa2c [MINOR] Update DOAP with 0.12.0 Release (#6413)
042241fa2c is described below

commit 042241fa2ca5c90161dc6e062485eef4e0981962
Author: Sagar Sumit 
AuthorDate: Wed Aug 17 11:26:19 2022 +0530

[MINOR] Update DOAP with 0.12.0 Release (#6413)
---
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 7b784ec549..e153fb3d4c 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -96,6 +96,11 @@
 2022-06-18
 0.11.1
   
+  
+Apache Hudi 0.12.0
+2022-08-16
+0.12.0
+

[GitHub] [hudi] codope merged pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



codope merged PR #6413:
URL: https://github.com/apache/hudi/pull/6413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



codope commented on PR #6413:
URL: https://github.com/apache/hudi/pull/6413#issuecomment-1217493647

   CI failure is not caused by this patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope opened a new pull request, #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-16 Thread GitBox



codope opened a new pull request, #6417:
URL: https://github.com/apache/hudi/pull/6417

   ### Change Logs
   
   - Release highlights in `website/releases/release-0.12.0.md`
   - Updated `website/releases/download.md`
   - Updated `docusaurus.config.js`
   
   https://user-images.githubusercontent.com/16440354/185044087-53426eef-3936-4700-9dd1-f67d258f7430.png";>
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] shubham-bungee commented on issue #6389: [SUPPORT] HELP :: Using TWO FIELDS to precombine :: 'hoodie.datasource.write.precombine.field': "column1,column2"

2022-08-16 Thread GitBox



shubham-bungee commented on issue #6389:
URL: https://github.com/apache/hudi/issues/6389#issuecomment-1217486370

   > Unfortunately, there is no out of the box solution to use two fields as 
preCombine for now.
   
   Thanks a lot for reply.
   We are a startup, planning to move to hudi, you might see few more support 
tickets coming your way. 
   Your help would be great in building new architecture. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6413: [MINOR] Update DOAP with 0.12.0 Release

2022-08-16 Thread GitBox



hudi-bot commented on PR #6413:
URL: https://github.com/apache/hudi/pull/6413#issuecomment-1217469882

   
   ## CI report:
   
   * 98d233b95b8653fa681b2c24aa900c7a86adddf3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10785)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3407) Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer scenario

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3407:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make sure Restore operation is Not Concurrent w/ Writes in Multi-Writer 
> scenario
> 
>
> Key: HUDI-3407
> URL: https://issues.apache.org/jira/browse/HUDI-3407
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently there's no guard-rail that would prevent Restore from running 
> concurrently with Writes in Multi-Writer scenario, which might lead to table 
> getting into inconsistent state.
>  
> One of the approaches could be letting Restore to acquire the Write lock for 
> the whole duration of its operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3585) Docs for (consistent) hashing index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3585:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Docs for (consistent) hashing index
> ---
>
> Key: HUDI-3585
> URL: https://issues.apache.org/jira/browse/HUDI-3585
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: docs
>Reporter: Yuwei Xiao
>Priority: Major
> Fix For: 0.12.1
>
>
> User documents related to (consistent) hashing index, will contain the 
> following content:
> - configs to enable bucket index and tuning parameters
> - use cases and demos
> - limitations and restrictions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-10) Auto tune bulk insert parallelism #555

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-10?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-10:

Fix Version/s: 0.12.1
   (was: 0.12.0)

> Auto tune bulk insert parallelism #555
> --
>
> Key: HUDI-10
> URL: https://issues.apache.org/jira/browse/HUDI-10
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Minor
> Fix For: 0.12.1
>
>
> https://github.com/uber/hudi/issues/555



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-686:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, performance
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1053) Make ComplexKeyGenerator also support non partitioned Hudi dataset

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1053:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make ComplexKeyGenerator also support non partitioned Hudi dataset
> --
>
> Key: HUDI-1053
> URL: https://issues.apache.org/jira/browse/HUDI-1053
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync, storage-management, writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently When using ComplexKeyGenerator a `default` partition is assumed. 
> Recently there has been interest in supporting non partitioned Hudi datasets 
> that uses ComplexKeyGenerator. This GitHub issue has context - 
> https://github.com/apache/hudi/issues/1747



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3965) Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for SparkSQLCLIDriver

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3965:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark sql dml w/ spark2 and scala12 fails w/ ClassNotFoundException for 
> SparkSQLCLIDriver
> -
>
> Key: HUDI-3965
> URL: https://issues.apache.org/jira/browse/HUDI-3965
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> spark-sql dml when launched w/ spark2 and scala 12 profile, fails with 
> ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver for both 0.10.0 and 
> 0.11.0. 
>  
> {code:java}
> java.lang.ClassNotFoundException: 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:816)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> // launch command
> ./bin/spark-sql --jars /home/hadoop/hudi-spark2.4-bundle_2.12-0.11.0-rc3.jar 
> --packages org.apache.spark:spark-avro_2.12:2.4.8 --conf 
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-13) Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive cluster #553

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-13:

Fix Version/s: 0.12.1
   (was: 0.12.0)

> Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive 
> cluster #553
> 
>
> Key: HUDI-13
> URL: https://issues.apache.org/jira/browse/HUDI-13
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, hive, Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.12.1
>
>
> https://github.com/uber/hudi/issues/553



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-736) Simplify ReflectionUtils#getTopLevelClasses

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-736:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Simplify ReflectionUtils#getTopLevelClasses 
> 
>
> Key: HUDI-736
> URL: https://issues.apache.org/jira/browse/HUDI-736
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: new-to-hudi
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1061:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Hudi CLI savepoint command fail because of spark conf loading issue
> ---
>
> Key: HUDI-1061
> URL: https://issues.apache.org/jira/browse/HUDI-1061
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Wenning Ding
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> h3. Reproduce
> open hudi-cli.sh and run these two commands:
> {code:java}
> connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint 
> create --commit 2019115109 
> {code}
> {{}}
> {{}}You would see this error:
> {{}}
> {code:java}
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97)
>  at org.apache.spark.SparkContext.(SparkContext.scala:523) at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) 
> at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) 
> at 
> org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>  at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>  at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>  at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) 
> at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at 
> java.lang.Thread.run(Thread.java:748){code}
> {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir  
>  hdfs:///var/log/spark/apps}}, but here hudi cli still uses 
> {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext 
> won't load the configs from {{spark-defaults.conf}}.
> We should make initJavaSparkConf method be able to read configs from spark 
> config file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1101) Decouple Hive dependencies from hudi-spark

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1101:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Decouple Hive dependencies from hudi-spark
> --
>
> Key: HUDI-1101
> URL: https://issues.apache.org/jira/browse/HUDI-1101
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Yanjia Gary Li
>Priority: Major
> Fix For: 0.12.1
>
>
> We have syncHive tool in both hudi-spark and hudi-utilities modules. This 
> might cause dependency conflict when the user don't use Hive at all. We could 
> move all the hive sync related method to hudi-hive-snyc module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1117) Add tdunning json library to spark and utilities bundle

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1117:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add tdunning json library to spark and utilities bundle
> ---
>
> Key: HUDI-1117
> URL: https://issues.apache.org/jira/browse/HUDI-1117
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, meta-sync
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Exception during Hive Sync:
> ```
> An error occurred while calling o175.save.\n: java.lang.NoClassDefFoundError: 
> org/json/JSONException\n\tat 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:10847)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10047)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10128)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:209)\n\tat
>  
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)\n\tat
>  org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)\n\tat 
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)\n\tat 
> org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)\n\tat 
> org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)\n\tat 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:515)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:498)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:488)\n\tat
>  
> org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:273)\n\tat
>  org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)\n\tat
> ```
> This is from using hudi-spark-bundle. 
> [https://github.com/apache/hudi/issues/1787]
> JSONException class is coming from 
> https://mvnrepository.com/artifact/org.json/json There is licensing issue and 
> hence not part of hudi bundle packages. The underlying issue is due to Hive 
> 1.x vs 2.x ( See 
> https://issues.apache.org/jira/browse/HUDI-150?jql=text%20~%20%22org.json%22%20and%20project%20%3D%20%22Apache%20Hudi%22%20)
> Spark Hive integration still brings in hive 1.x jars which depends on 
> org.json. I believe this was provided in user's environment and hence we have 
> not seen folks complaining about this issue.
> Even though this is not Hudi issue per se, let me check a jar with compatible 
> license : https://mvnrepository.com/artifact/com.tdunning/json/1.8 and if it 
> works, we will add to 0.6 bundles after discussing with community. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1145) Debug if Insert operation calls upsert in case of small file handling path.

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1145:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Debug if Insert operation calls upsert in case of small file handling path.
> ---
>
> Key: HUDI-1145
> URL: https://issues.apache.org/jira/browse/HUDI-1145
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.12.1
>
>
> INSERT operations may be triggering UPSERT internally in the Merging process 
> when dealing with small files. This surfaced out of a SLACK thread. Need to 
> config if this is indeed is happening. If yes, this needs to be fixed such 
> that the MERGE HANDLE should not call upsert and instead let conflicting 
> records into the file if it is an INSERT operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1245) Make debugging Integ tests easier

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1245:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make debugging Integ tests easier
> -
>
> Key: HUDI-1245
> URL: https://issues.apache.org/jira/browse/HUDI-1245
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dev-experience, Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> Debugging integ-tests are harder and not easy for except for folks to 
> investigate. this effort tracks the work for the same.
>  
> Also, publish a guide for debugging such integ-tests.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3961) Encounter NoClassDefFoundError when using Spark 3.1 bundle and utilities slim bundle

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3961:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Encounter NoClassDefFoundError when using Spark 3.1 bundle and utilities slim 
> bundle
> 
>
> Key: HUDI-3961
> URL: https://issues.apache.org/jira/browse/HUDI-3961
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When running deltastreamer with both Spark 3.1 and utilities slim bundle 
> (compiled with Spark 3.2 profile), the following exception is thrown:
> {code:java}
> export SPARK_HOME=/Users/ethan/Work/lib/spark-3.1.3-bin-hadoop3.2
> export 
> HUDI_SPARK_BUNDLE_JAR=/Users/ethan/Work/lib/hudi_releases/0.11.0-rc3/hudi-spark3.1-bundle_2.12-0.11.0-rc3.jar
> export 
> HUDI_UTILITIES_SLIM_JAR=/Users/ethan/Work/lib/hudi_releases/0.11.0-rc3/hudi-utilities-slim-bundle_2.12-0.11.0-rc3.jar
> /Users/ethan/Work/lib/spark-3.1.3-bin-hadoop3.2/bin/spark-submit \
>       --master local[4] \
>       --driver-memory 4g --executor-memory 2g --num-executors 4 
> --executor-cores 1 \
>       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>       --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>       --conf spark.sql.catalogImplementation=hive \
>       --conf spark.driver.maxResultSize=1g \
>       --conf spark.speculation=true \
>       --conf spark.speculation.multiplier=1.0 \
>       --conf spark.speculation.quantile=0.5 \
>       --conf spark.ui.port=6680 \
>       --conf spark.eventLog.enabled=true \
>       --conf spark.eventLog.dir=/Users/ethan/Work/data/hudi/spark-logs \
>       --packages org.apache.spark:spark-avro_2.12:3.1.3 \
>       --jars 
> /Users/ethan/Work/repo/hudi-benchmarks/target/hudi-benchmarks-0.1-SNAPSHOT.jar,$HUDI_SPARK_BUNDLE_JAR
>  \
>       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>       $HUDI_UTILITIES_SLIM_JAR \
>       --props $TEST_ROOT_DIR/ds_mor.properties \
>       --source-class BenchmarkDataSource \
>       --source-ordering-field ts \
>       --target-base-path $TEST_ROOT_DIR/test_table \
>       --target-table test_table \
>       --table-type MERGE_ON_READ \
>       --op UPSERT \
>       --continuous{code}
>  
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieException: 
> java.lang.NoClassDefFoundError: org/apache/avro/AvroMissingFieldException
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:191)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:186)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:549)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>     at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NoClassDefFoundError: org/apache/avro/AvroMissingFieldException
>     at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>     at 
> org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:189)
>     ... 15 more
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/avro/AvroMissingFieldException
>     at 
> org.apache.hudi.avro.model.HoodieCleanerPlan.newBuilder(HoodieCleanerPlan.java:246)
>     at 
> org.apache.hudi.table.action.clean.CleanPlanActionExecutor.reques

[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-992:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: bootstrap, meta-sync
>Affects Versions: 0.9.0
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-488) Refactor Source classes in hudi-utilities

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-488:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Refactor Source classes in hudi-utilities 
> --
>
> Key: HUDI-488
> URL: https://issues.apache.org/jira/browse/HUDI-488
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.12.1
>
>
> There are copy-and-paste code in some of the Source classes due to the 
> current class inheritance structure.  Refactoring of this part should make it 
> easier and more efficient to create new sources and format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1036:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: hive
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1590) Support async clustering w/ test suite job

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1590:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support async clustering w/ test suite job
> --
>
> Key: HUDI-1590
> URL: https://issues.apache.org/jira/browse/HUDI-1590
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 0.12.1
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> As of now, we only have inline clustering support w/ hoodie test suite job. 
> we need to add support for async clustering. 
> This might be tricky since the regular writes should not overstep w/ 
> clustering. if not the pipeline will fail. So, data generation has to go hand 
> in hand w/ clustering configs. For eg, if clustering will get triggered every 
> 4 commits, data generation should switch partitions for every 4 batches of 
> input. That way there won't be any overstepping and pipeline can run for as 
> many iterations as needed. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3864) Avoid fetching all files for all partitions on the read/query path for flink

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3864:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Avoid fetching all files for all partitions on the read/query path for flink
> 
>
> Key: HUDI-3864
> URL: https://issues.apache.org/jira/browse/HUDI-3864
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink
>Reporter: sivabalan narayanan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>
> Fetching all files across all partitions should be avoided in hot path. 
> especially on the query side. we should only fetch files for interested 
> partitions. 
> I inspected HoodieFileIndex for spark and things looks to be ok. We only load 
> files for the partitions involved in the query. 
>  
> {code:java}
> public BaseHoodieTableFileIndex(HoodieEngineContext engineContext,
> HoodieTableMetaClient metaClient,
> TypedProperties configProperties,
> HoodieTableQueryType queryType,
> List queryPaths, 
> {code}
> Querypaths in above argument contains only the partitions involved in the 
> split. 
> later when we load the files, we load only for the matched partitions. 
> {code:java}
> private Map loadPartitionPathFiles() {
>   // List files in all partition paths
>   List pathToFetch = new ArrayList<>();
>   Map cachedPartitionToFiles = new HashMap<>();
>   // Fetch from the FileStatusCache
>   List partitionPaths = getAllQueryPartitionPaths();
>   partitionPaths.forEach(partitionPath -> {
> Option filesInPartition = 
> fileStatusCache.get(partitionPath.fullPartitionPath(basePath));
> if (filesInPartition.isPresent()) {
>   cachedPartitionToFiles.put(partitionPath, filesInPartition.get());
> } else {
>   pathToFetch.add(partitionPath);
> }
>   });
>   Map fetchedPartitionToFiles;
>   if (pathToFetch.isEmpty()) {
> fetchedPartitionToFiles = Collections.emptyMap();
>   } else {
> Map fullPartitionPathsMapToFetch = 
> pathToFetch.stream()
> .collect(Collectors.toMap(
> partitionPath -> 
> partitionPath.fullPartitionPath(basePath).toString(),
> Function.identity())
> );
> fetchedPartitionToFiles =
> FSUtils.getFilesInPartitions(
> engineContext,
> metadataConfig,
> basePath,
> fullPartitionPathsMapToFetch.keySet().toArray(new String[0]),
> fileSystemStorageConfig.getSpillableDir())
> .entrySet()
> .stream()
> .collect(Collectors.toMap(e -> 
> fullPartitionPathsMapToFetch.get(e.getKey()), e -> e.getValue()));
>   }
>   // Update the fileStatusCache
>   fetchedPartitionToFiles.forEach((partitionPath, filesInPartition) -> {
> fileStatusCache.put(partitionPath.fullPartitionPath(basePath), 
> filesInPartition);
>   });
>   return CollectionUtils.combine(cachedPartitionToFiles, 
> fetchedPartitionToFiles);
> } {code}
>  
> I also inspected flink and may we we are loading all files across all 
> partitions. 
>  
> IncrementalInputSplits 
> [L180|https://github.com/apache/hudi/blob/d16740976e3aa89f2d934b0f1c48208dfe40bc5f/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java#L180]
> fileStatuses = fileIndex.getFilesInPartitions();
>  
> HoodieTableSource 
> [L298|https://github.com/apache/hudi/blob/d16740976e3aa89f2d934b0f1c48208dfe40bc5f/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java#L298]
> FileStatus[] fileStatuses = fileIndex.getFilesInPartitions();
>  
> I do see we pass in required partition paths in both places. But will leave 
> it to flink experts to inspect the code once and close out the ticket if no 
> action required. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1885) Support Delete/Update Non-Pk Table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1885:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support Delete/Update Non-Pk Table
> --
>
> Key: HUDI-1885
> URL: https://issues.apache.org/jira/browse/HUDI-1885
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, spark-sql
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.12.1
>
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-747) Implement Rollback like API in HoodieWriteClient which can revert all actions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-747:
-
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement Rollback like API in HoodieWriteClient which can revert all actions 
> --
>
> Key: HUDI-747
> URL: https://issues.apache.org/jira/browse/HUDI-747
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> Related to HUDI-716 and PR-1432
> The PR address the specific issue of deleting orphaned inflight/requested 
> clean actions by older versions of Hudi. 
> Currently rollback performs reverting only commit and delta-commit 
> operations. We can introduce a new API which will consistently revert all 
> pending actions clean, compact, commit  and delta-commit.  Currently, we dont 
> rollback clean. Instead we expect future clean operations to first finish up 
> pending clean first. By having this new API (rollbackPendingActions), we can 
> let users consistently revert any actions if they want.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2157) Spark write the bucket index table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2157:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark write the bucket index table
> --
>
> Key: HUDI-2157
> URL: https://issues.apache.org/jira/browse/HUDI-2157
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1179) Add Row tests to all key generator test classes

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1179:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add Row tests to all key generator test classes
> ---
>
> Key: HUDI-1179
> URL: https://issues.apache.org/jira/browse/HUDI-1179
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1207) Add kafka implementation of write commit callback to Spark datasources

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1207:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add kafka implementation of write commit callback to Spark datasources
> --
>
> Key: HUDI-1207
> URL: https://issues.apache.org/jira/browse/HUDI-1207
> Project: Apache Hudi
>  Issue Type: Task
>Affects Versions: 0.9.0
>Reporter: wangxianghu#1
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2003:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinay
>Priority: Minor
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1645) Add unit test to verify clean and rollback instants are archived correctly

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1645:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add unit test to verify clean and rollback instants are archived correctly
> --
>
> Key: HUDI-1645
> URL: https://issues.apache.org/jira/browse/HUDI-1645
> Project: Apache Hudi
>  Issue Type: Test
>  Components: table-service
>Affects Versions: 0.9.0
>Reporter: satish
>Assignee: satish
>Priority: Major
> Fix For: 0.12.1
>
>
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/io/TestHoodieTimelineArchiveLog.java
> The tests dont seem to cover clean/rollback instants. Add those instants and 
> make sure those instants are archived correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1748) Read operation will possibility fail on mor table rt view when a write operations is concurrency running

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1748:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Read operation will possibility fail on mor table rt view when a write 
> operations is concurrency running
> 
>
> Key: HUDI-1748
> URL: https://issues.apache.org/jira/browse/HUDI-1748
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: lrz
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, query-eng, 
> user-support-issues
> Fix For: 0.12.1
>
>
> during reading operation, a new base file maybe produced by a writting 
> operation. then the reading will opooibility to get a NPE when getSplit. here 
> is the exception stack:
> !https://wa.vision.huawei.com/vision-file-storage/api/file/download/upload-v2/2021/2/15/qwx352829/7bacca8042104499b0991d50b4bc3f2a/image.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1574) Trim existing unit tests to finish in much shorter amount of time

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1574:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Trim existing unit tests to finish in much shorter amount of time
> -
>
> Key: HUDI-1574
> URL: https://issues.apache.org/jira/browse/HUDI-1574
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Minor
> Fix For: 0.12.1
>
>
> spark-client-tests
> 278.165 s - in org.apache.hudi.table.TestHoodieMergeOnReadTable
> 201.628 s - in org.apache.hudi.metadata.TestHoodieBackedMetadata
> 185.716 s - in org.apache.hudi.client.TestHoodieClientOnCopyOnWriteStorage
> 158.361 s - in org.apache.hudi.index.TestHoodieIndex
> 156.196 s - in org.apache.hudi.table.TestCleaner
> 132.369 s - in 
> org.apache.hudi.table.action.commit.TestCopyOnWriteActionExecutor
> 93.307 s - in org.apache.hudi.table.action.compact.TestAsyncCompaction
> 67.301 s - in org.apache.hudi.table.upgrade.TestUpgradeDowngrade
> 45.794 s - in org.apache.hudi.client.TestHoodieReadClient
> 38.615 s - in org.apache.hudi.index.bloom.TestHoodieBloomIndex
> 31.181 s - in org.apache.hudi.client.TestTableSchemaEvolution
> 20.072 s - in org.apache.hudi.table.action.compact.TestInlineCompaction
> grep " Time elapsed" hudi-client/hudi-spark-client/target/surefire-reports/* 
> | awk -F',' ' { print $5 } ' | awk -F':' ' { print $2 } ' | sort -nr | less
> hudi-utilities
> 209.936 s - in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer
> 204.653 s - in 
> org.apache.hudi.utilities.functional.TestHoodieMultiTableDeltaStreamer
> 34.116 s - in org.apache.hudi.utilities.sources.TestKafkaSource
> 29.865 s - in org.apache.hudi.utilities.sources.TestParquetDFSSource
> 26.189 s - in 
> org.apache.hudi.utilities.sources.helpers.TestDatePartitionPathSelector
> Other Tests
> 42.595 s - in org.apache.hudi.common.functional.TestHoodieLogFormat
> 38.918 s - in org.apache.hudi.common.bootstrap.TestBootstrapIndex
> 22.046 s - in 
> org.apache.hudi.common.functional.TestHoodieLogFormatAppendFailure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1616) Abstract out one off operations within dag

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1616:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Abstract out one off operations within dag
> --
>
> Key: HUDI-1616
> URL: https://issues.apache.org/jira/browse/HUDI-1616
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing, tests-ci
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.1
>
>
> In existing test suite, we have a config called "execute_itr_count". When 
> this is set to N for a particular node, out of 50 odd iterations, this node 
> will be executed only on Nth iteration only. 
> Use-case: 
> we wish to execute clustering node on 10th iteration. but the entire dag 
> needs to be executed for 25 iterations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1329) Support async compaction in spark DF write()

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1329:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support async compaction in spark DF write()
> 
>
> Key: HUDI-1329
> URL: https://issues.apache.org/jira/browse/HUDI-1329
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: compaction, spark, table-service
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> spark.write().format("hudi").option(operation, "run_compact") to run 
> compaction
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1275) Incremental TImeline Syncing causes compaction to fail with FileNotFound exception

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1275:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Incremental TImeline Syncing causes compaction to fail with FileNotFound 
> exception
> --
>
> Key: HUDI-1275
> URL: https://issues.apache.org/jira/browse/HUDI-1275
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.1
>
>
> Context: [https://github.com/apache/hudi/issues/2020]
>  
>  
> {{20/08/25 07:17:13 WARN TaskSetManager: Lost task 3.0 in stage 41.0 (TID 
> 2540, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): 
> org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No 
> such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
> at 
> org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
> at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory 
> 's3://myBucket/absolute_path_to/daas_date=2020/56be5da5-f5f3-4675-8dec-433f3656f839-0_3-816-50630_20200825065331.parquet'
> at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
> at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
> at 
> org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
> at 
> org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
> ... 26 more}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1247) Add jmh based benchmarking to hudi

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1247:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add jmh based benchmarking to hudi
> --
>
> Key: HUDI-1247
> URL: https://issues.apache.org/jira/browse/HUDI-1247
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing, tests-ci
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: performance
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1380) Async cleaning does not work with Timeline Server

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1380:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Async cleaning does not work with Timeline Server
> -
>
> Key: HUDI-1380
> URL: https://issues.apache.org/jira/browse/HUDI-1380
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, table-service, timeline-server
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1212) GDPR: Support deletions of records on all versions of Hudi dataset

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1212:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> GDPR: Support deletions of records on  all versions of Hudi dataset
> ---
>
> Key: HUDI-1212
> URL: https://issues.apache.org/jira/browse/HUDI-1212
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.12.1
>
>
> Incremental Pull should also stop returning the record on historical  datset 
> when we delete them from latest snapshot.
>  
> Context from Mailing list email :
>  
> Hello,
> I am Siva's colleague and I am working on the problem below as well.
> I would like to describe what we are trying to achieve with Hudi as well as 
> our current way of working and our GDPR and "Right To Be Forgotten " 
> compliance policies.
> Our requirements :
> - We wish to apply a strict interpretation of the RTBF.  In other words, when 
> we remove a person's data, it should be throughout the historical data and 
> not just the latest snapshot.
> - We wish to use Hudi to reduce our storage requirements using upserts and 
> don't want to have duplicates between commits.
> - We wish to retain history for persons who have not requested to be 
> forgotten and therefore we do not want to delete commit files from the 
> history as some have proposed.
> We have tried a couple of solutions, but so far without success :
> - replay the data omitting the data of the persons who have requested to be 
> forgotten.  We wanted to manipulate the commit times to rebuild the history.
> We found that we couldn't manipulate the commit times and retain the history.
> - replay the data omitting the data of the persons who have requested to be 
> forgotten, but writing to a date-based partition folder using the 
> "partitionpath" parameter.
> We found that commits using upserts between the partitionpath folders, do not 
> ignore data that is unchanged between 2 commit dates as when using the 
> default commit file system, so we will not save on our storage or speed up 
> our  processing using this technique.
> So basically we would like to find a way to apply a strict RTBF, GDPR, 
> maintain history and time-travel (large history) and save storage space using 
> Hudi.
> Can anyone see a way to achieve this?
> Kind Regards,
> David Rosalia
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2154) Add index key field into HoodieKey

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2154:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add index key field into HoodieKey
> --
>
> Key: HUDI-2154
> URL: https://issues.apache.org/jira/browse/HUDI-2154
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: index
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1823) Hive/Presto Integration with ORC

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1823:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Hive/Presto Integration with ORC
> 
>
> Key: HUDI-1823
> URL: https://issues.apache.org/jira/browse/HUDI-1823
> Project: Apache Hudi
>  Issue Type: Task
>  Components: storage-management
>Reporter: Teresa Kang
>Priority: Major
> Fix For: 0.12.1
>
>
> Implement HoodieOrcInputFormat to support ORC with spark/presto query engines.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2612) No need to define primary key for flink insert operation

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2612:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> No need to define primary key for flink insert operation
> 
>
> Key: HUDI-2612
> URL: https://issues.apache.org/jira/browse/HUDI-2612
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>
> There is one exception: the MOR table may still needs the pk to generate 
> {{HoodieKey}} for #preCombine and compaction merge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2156) Cluster the table with bucket index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2156:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Cluster the table with bucket index
> ---
>
> Key: HUDI-2156
> URL: https://issues.apache.org/jira/browse/HUDI-2156
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1556) Add App Id and App name to HoodieDeltaStreamerMetrics

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1556:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add App Id and App name to HoodieDeltaStreamerMetrics
> -
>
> Key: HUDI-1556
> URL: https://issues.apache.org/jira/browse/HUDI-1556
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metrics
>Affects Versions: 0.9.0
>Reporter: wangxianghu#1
>Priority: Major
> Fix For: 0.12.1
>
>
> we need something unique to relate Metric data to spark job



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1271) Add utility scripts to perform Restores

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1271:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add utility scripts to perform Restores
> ---
>
> Key: HUDI-1271
> URL: https://issues.apache.org/jira/browse/HUDI-1271
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: cli, Utilities
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>
> We need to expose commands for performing restores.
> We have similar scripts for cleaner : 
> org.apache.hudi.utilities.HoodieCleaner
> We need to add something similar for restores.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1698:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, writer-core
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1779:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, spark
>Reporter: lrz
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1440) Allow option to override schema when doing spark.write

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1440:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Allow option to override schema when doing spark.write
> --
>
> Key: HUDI-1440
> URL: https://issues.apache.org/jira/browse/HUDI-1440
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: user-support-issues
> Fix For: 0.12.1
>
>
> Need ability to pass schema and use it to create RDD when creating input 
> batch from data-frame. 
>  
> df.write.format("hudi").option("hudi.avro.schema", "")..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2188) Improve test for the insert_overwrite and insert_overwrite_table in hoodieDeltaStreamer

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2188:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Improve test for the insert_overwrite and insert_overwrite_table in 
> hoodieDeltaStreamer
> ---
>
> Key: HUDI-2188
> URL: https://issues.apache.org/jira/browse/HUDI-2188
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: Samrat Deb
>Assignee: Samrat Deb
>Priority: Major
> Fix For: 0.12.1
>
>
> InsertOverwrite overwrites only the partitions matching the incoming records. 
> need to add a test that verifies insert_overwrite does not overwrite 
> mismatched partitions. 
> reference -: https://github.com/apache/hudi/pull/3184/files#r670993094



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2946) Upgrade maven plugin to make Hudi be compatible with higher Java versions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2946:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Upgrade maven plugin to make Hudi be compatible with higher Java versions
> -
>
> Key: HUDI-2946
> URL: https://issues.apache.org/jira/browse/HUDI-2946
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Wenning Ding
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> I saw several issues while building Hudi w/ Java 11:
>  
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar (default) on project 
> hudi-common: Execution default of goal 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar: 
> java.lang.ExceptionInInitializerError: null[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-shade-plugin:3.1.1:shade (default) on project 
> hudi-hadoop-mr-bundle: Error creating shaded jar: Problem shading JAR 
> /workspace/workspace/rchertar.bigtop.hudi-rpm-mainline-6.x-0.9.0/build/hudi/rpm/BUILD/hudi-0.9.0-amzn-1-SNAPSHOT/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.9.0-amzn-1-SNAPSHOT.jar
>  entry org/apache/hudi/hadoop/bundle/Main.class: 
> java.lang.IllegalArgumentException -> [Help 1]}}
>  
> We need to upgrade maven plugin versions to make it be compatible with Java 
> 11.
> Also upgrade dockerfile-maven-plugin to latest versions to support Java 11 
> [https://github.com/spotify/dockerfile-maven/pull/230]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2657) Make inlining configurable based on diff use-case.

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2657:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make inlining configurable based on diff use-case. 
> ---
>
> Key: HUDI-2657
> URL: https://issues.apache.org/jira/browse/HUDI-2657
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.12.1
>
>
> Make inlining configurable based on diff use-case.
> Files partition, column_stats and bloom might need inlining. but record level 
> index may not need inline reading. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2260) Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2260:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs
> -
>
> Key: HUDI-2260
> URL: https://issues.apache.org/jira/browse/HUDI-2260
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> Fix hardcoding of SimpleKeyGen for default KeyGenProp for virtual key configs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3000) [UMBRELLA] Consistent Hashing Index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3000:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> [UMBRELLA] Consistent Hashing Index
> ---
>
> Key: HUDI-3000
> URL: https://issues.apache.org/jira/browse/HUDI-3000
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: index
>Reporter: Yuwei Xiao
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2884) Allow loading external configs while querying Hudi tables with Spark

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2884:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Allow loading external configs while querying Hudi tables with Spark
> 
>
> Key: HUDI-2884
> URL: https://issues.apache.org/jira/browse/HUDI-2884
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently when doing Hudi queries w/ Spark, it won't load the external 
> configurations. Say if customers enabled metadata listing in their global 
> config file, then this would let them actually query w/o metadata feature 
> enabled. This CR fixes this issue and allows loading global configs during 
> the Hudi reading phase.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2638) Rewrite tests around Hudi index

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2638:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rewrite tests around Hudi index
> ---
>
> Key: HUDI-2638
> URL: https://issues.apache.org/jira/browse/HUDI-2638
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> There are duplicate code between `TestFlinkHoodieBloomIndex` and 
> `TestHoodieBloomIndex`, among other test classes.  We should do one pass to 
> clean the test code once the refactoring is done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2928:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Evaluate rebasing Hudi's default compression from Gzip to Zstd
> --
>
> Key: HUDI-2928
> URL: https://issues.apache.org/jira/browse/HUDI-2928
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance, storage-management
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot 
> 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png
>
>
> Currently, having Gzip as a default we prioritize Compression/Storage cost at 
> the expense of
>  * Compute (on the {+}write-path{+}): about *30%* of Compute burned during 
> bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
>  * Compute (on the {+}read-path{+}), as well as queries Latencies: queries 
> scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put 
> is *3-4x* less than Snappy, Zstd, 
> [EX|https://stackoverflow.com/a/56410326/3520840])
> P.S Spark switched its default compression algorithm to Snappy [a while 
> ago|https://github.com/apache/spark/pull/12256].
>  
> *EDIT*
> We should actually evaluate putting in 
> [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  instead of Snappy. It has compression ratios comparable to Gzip, while 
> bringing in much better performance:
> !image-2021-12-03-13-13-02-892.png!
> [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2762:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive
>Reporter: Rajesh Mahindra
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/®ion=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2932) Debug CI failure for TestHoodieBackedMetadata#testCleaningArchivingAndCompaction

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2932:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Debug CI failure for 
> TestHoodieBackedMetadata#testCleaningArchivingAndCompaction
> 
>
> Key: HUDI-2932
> URL: https://issues.apache.org/jira/browse/HUDI-2932
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Major
> Fix For: 0.12.1
>
>
> TestHoodieBackedMetadata#testCleaningArchivingAndCompaction is flaky in CI. 
> Whereas it is consistently passing locally. Need to debug and the find root 
> cause 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2954) Code cleanup: HFileDataBock - using integer keys is never used

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2954:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Code cleanup: HFileDataBock - using integer keys is never used 
> ---
>
> Key: HUDI-2954
> URL: https://issues.apache.org/jira/browse/HUDI-2954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Minor
> Fix For: 0.12.1
>
>
>  
> KeyField can never be empty for File. If so, there is really no need for 
> falling back to sequential integer keys in the 
> HFileDataBlock::serializeRecords() code path.
>  
> {noformat}
> // Build the record key
> final Field schemaKeyField = 
> records.get(0).getSchema().getField(this.keyField);
> if (schemaKeyField == null) {
>   // Missing key metadata field. Use an integer sequence key instead.
>   useIntegerKey = true;
>   keySize = (int) Math.ceil(Math.log(records.size())) + 1;
> }
> while (itr.hasNext()) {
>   IndexedRecord record = itr.next();
>   String recordKey;
>   if (useIntegerKey) {
> recordKey = String.format("%" + keySize + "s", key++);
>   } else {
> recordKey = record.get(schemaKeyField.pos()).toString();
>   }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2237) Add virtual key support for ORC file format

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2237:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add virtual key support for ORC file format
> ---
>
> Key: HUDI-2237
> URL: https://issues.apache.org/jira/browse/HUDI-2237
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2504) Add configuration to make HoodieBootstrap support ignoring file suffix

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2504:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add configuration to make HoodieBootstrap support ignoring file suffix
> --
>
> Key: HUDI-2504
> URL: https://issues.apache.org/jira/browse/HUDI-2504
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: bootstrap
>Reporter: liujinhui
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2754) Performance improvement for IncrementalRelation

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2754:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Performance improvement for IncrementalRelation
> ---
>
> Key: HUDI-2754
> URL: https://issues.apache.org/jira/browse/HUDI-2754
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query, performance
>Reporter: Jintao
>Assignee: Jintao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When HoodieIncrSource is used to fetch the update from another Hudi table, 
> the IncrementalRelation will be used to read the data. But it has a 
> performance issue because the column pruning and predicate pushdown don't 
> happen. As the result, Hudi reads too much useless data.
> By enabling the column pruning and predicate pushdown, the data to read is 
> reduced dramatically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2991) Add rename partition for spark sql

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2991:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add rename partition for spark sql
> --
>
> Key: HUDI-2991
> URL: https://issues.apache.org/jira/browse/HUDI-2991
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
> Fix For: 0.12.1
>
>
> – RENAME partition
> ||{{partition}}||
> |{{age=10}}|
> |{{age=11}}|
> |{{age=12}}|
>  {{ALTER TABLE default.StudentInfo PARTITION (age='10') RENAME TO PARTITION 
> (age='15');}}
> ||{{partition}}||
> |{{age=11}}|
> |{{age=12}}|
> |{{age=15}}|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2988) Add Event time configuration: latency adjustment

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2988:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Add Event time configuration: latency adjustment
> 
>
> Key: HUDI-2988
> URL: https://issues.apache.org/jira/browse/HUDI-2988
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
> Fix For: 0.12.1
>
>
> hoodie.payload.event.time.adjust.seconds
> default to 0
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3049) Use flink table name as default synced hive table name

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3049:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Use flink table name as default synced hive table name
> --
>
> Key: HUDI-3049
> URL: https://issues.apache.org/jira/browse/HUDI-3049
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2860) Make timeline server work with concurrent/async table service

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2860:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make timeline server work with concurrent/async table service
> -
>
> Key: HUDI-2860
> URL: https://issues.apache.org/jira/browse/HUDI-2860
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> Make timeline server work with multiple concurrent writers. 
> As of now, if an executor is lagging wrt timeline server (timeline server 
> refreshes its state for every call if timeline has moved), we throw an 
> exception and executor falls back to secondary which will list the file 
> system. 
>  
> Related ticket: https://issues.apache.org/jira/browse/HUDI-2761
>  
> We want to revisit this code and see how can we make timeline server work 
> with multi-writer scenario. 
>  
> Few points to consider:
> 1. Executors should try to call getLatestBaseFilesOnOrBefore() instead of 
> getLatestBaseFiles(). Not calls has to be fixed. the ones doing conflict 
> resolutions, might have to get the latest snapshot always. 
> 2. Fix async services to use separate write client in deltastreamer flow
> 3. Revist every call from executor and set "REFRESH" param on only when 
> matters.
> 4. Sharing embedded timeline server. 
> 5. Check for any holes. when C100 and C101 concurrently started and C101 
> finishes early, if C100 makes getLatestBaseFileOnOrBefore(), do we return 
> base files from C101? 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2808) Supports deduplication for streaming write

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2808:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Supports deduplication for streaming write
> --
>
> Key: HUDI-2808
> URL: https://issues.apache.org/jira/browse/HUDI-2808
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: WangMinChao
>Assignee: WangMinChao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Currently, flink changlog stream writes to MOR table, which can be 
> deduplicated during batch reading, but it will not be deduplicated during 
> stream reading. However, many users hope that stream reading can also achieve 
> the upsert capability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2940) Support sync database and table created by Flink catalog to hive

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2940:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Support sync database and table created by Flink catalog to hive
> 
>
> Key: HUDI-2940
> URL: https://issues.apache.org/jira/browse/HUDI-2940
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink
>Reporter: dalongliu
>Priority: Major
> Fix For: 0.12.1
>
>
> As above title says, we should support sync database and table created by 
> Flink catalog to hive, this will help user for analyze table conveniently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2786) Failed to connect to namenode in Docker Demo on Apple M1 chip

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2786:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Failed to connect to namenode in Docker Demo on Apple M1 chip
> -
>
> Key: HUDI-2786
> URL: https://issues.apache.org/jira/browse/HUDI-2786
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, dev-experience
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.12.1
>
>
> {code:java}
> > ./setup_demo.sh 
> [+] Running 1/0
>  ⠿ compose  Warning: No resource found to remove                              
>                                                                               
>                                             0.0s
> [+] Running 15/15
>  ⠿ namenode Pulled                                                            
>                                                                               
>                                             1.4s
>  ⠿ kafka Pulled                                                               
>                                                                               
>                                             1.3s
>  ⠿ presto-worker-1 Pulled                                                     
>                                                                               
>                                             1.3s
>  ⠿ historyserver Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ adhoc-2 Pulled                                                             
>                                                                               
>                                             1.3s
>  ⠿ adhoc-1 Pulled                                                             
>                                                                               
>                                             1.4s
>  ⠿ graphite Pulled                                                            
>                                                                               
>                                             1.3s
>  ⠿ sparkmaster Pulled                                                         
>                                                                               
>                                             1.3s
>  ⠿ hive-metastore-postgresql Pulled                                           
>                                                                               
>                                             1.3s
>  ⠿ presto-coordinator-1 Pulled                                                
>                                                                               
>                                             1.3s
>  ⠿ spark-worker-1 Pulled                                                      
>                                                                               
>                                             1.4s
>  ⠿ hiveserver Pulled                                                          
>                                                                               
>                                             1.3s
>  ⠿ hivemetastore Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ zookeeper Pulled                                                           
>                                                                               
>                                             1.3s
>  ⠿ datanode1 Pulled                                                           
>                                                                               
>                                             1.3s
> [+] Running 16/16
>  ⠿ Network compose_default              Created                               
>                                                                               
>                                             0.0s
>  ⠿ Container hive-metastore-postgresql  Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container kafkabroker                Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container zookeeper                  Started                               
>

[jira] [Updated] (HUDI-3017) Infer FlinkStreamer options like table source

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3017:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Infer FlinkStreamer options like table source
> -
>
> Key: HUDI-3017
> URL: https://issues.apache.org/jira/browse/HUDI-3017
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Assignee: singh.zhang
>Priority: Major
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3304) support partial update on mor table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3304:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> support partial update on mor table 
> 
>
> Key: HUDI-3304
> URL: https://issues.apache.org/jira/browse/HUDI-3304
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: writer-core
>Reporter: Jian Feng
>Assignee: Jian Feng
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: image2022-1-13_0-33-5.png
>
>
> h2. current status
>  * OverwriteNonDefaultsWithLatestAvroPayload implement partial update 
> behavior in combineAndGetUpdateValue method
>  * Spark sql also have a 'Merge into' syntax support partial update by 
> ExpressionPayload,
>  * both OverwriteNonDefaultsWithLatestAvroPayload and ExpressionPayload can 
> not handle partial update in preCombine method, so they can only support 
> partial update with COW table
> h2. solution
> make preCombine function also support partial update(need pass schema as 
> parameter)
> !image2022-1-13_0-33-5.png|width=832,height=516!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2867) Make HoodiePartitionPath optional

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2867:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Make HoodiePartitionPath optional
> -
>
> Key: HUDI-2867
> URL: https://issues.apache.org/jira/browse/HUDI-2867
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> we should make partition path optional and support end to end for all 
> operations. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3335:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>  Labels: hudi-on-call, user-support-issues
> Fix For: 0.12.1
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-

[jira] [Updated] (HUDI-3121) Spark datasource with bucket index unit test reuse

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3121:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Spark datasource with bucket index unit test reuse
> --
>
> Key: HUDI-3121
> URL: https://issues.apache.org/jira/browse/HUDI-3121
> Project: Apache Hudi
>  Issue Type: Test
>  Components: index, tests-ci
>Reporter: XiaoyuGeng
>Priority: Major
> Fix For: 0.12.1
>
>
> let `TestMORDataSourceWithBucket` reuse existing unit test by parameterizing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3381) Rebase `HoodieMergeHandle` to operate on `HoodieRecord`

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3381:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rebase `HoodieMergeHandle` to operate on `HoodieRecord`
> ---
>
> Key: HUDI-3381
> URL: https://issues.apache.org/jira/browse/HUDI-3381
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> From RFC-46:
> `HoodieWriteHandle`s will be  
>    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
> conversion)
>    2. Using Combining API engine to merge records (when necessary) 
>    3. Passes `HoodieRecord` as is to `FileWriter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2768) Enable async timeline server by default

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2768:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Enable async timeline server by default
> ---
>
> Key: HUDI-2768
> URL: https://issues.apache.org/jira/browse/HUDI-2768
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: timeline-server, writer-core
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: hudi-on-call
> Fix For: 0.12.1
>
>
> Enable async timeline server by default.
>  
> [https://github.com/apache/hudi/pull/3949]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2544) Use standard builder pattern to refactor ConfigProperty

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-2544:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Use standard builder pattern to refactor ConfigProperty
> ---
>
> Key: HUDI-2544
> URL: https://issues.apache.org/jira/browse/HUDI-2544
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, configs
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Minor
> Fix For: 0.12.1
>
>
> I notice that currently define a ConfigProperty object by non-standard 
> builder pattern. Only `defaultValue` and `noDefaultValue` methods are 
> executed in `PropertyBuilder`.
>  
> And when call `withAlternatives`, `sinceVersion`, `deprecatedAfter`, 
> `withInferFunction` methods, will create another ConfigProperty object even 
> that will be collected by jvm later.
>  
> So, is it necessary to minor-refactor this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3351) Rebase Record combining semantic into `HoodieRecordCombiningEngine`

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3351:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Rebase Record combining semantic into `HoodieRecordCombiningEngine`
> ---
>
> Key: HUDI-3351
> URL: https://issues.apache.org/jira/browse/HUDI-3351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> From RFC-46:
> Extract Record Combining (Merge) API from `HoodieRecordPayload` into a 
> standalone, stateless component – `HoodieRecordCombiningEngine`.
> Such component will be
> 1. Abstracted as stateless object providing API to combine records (according 
> to predefined semantics) for engines (Spark, Flink) of interest
> 2. Plug-in point for user-defined combination semantics



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3385) Implement Spark-specific `FileReader`s

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3385:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Implement Spark-specific `FileReader`s
> --
>
> Key: HUDI-3385
> URL: https://issues.apache.org/jira/browse/HUDI-3385
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> To fully avoid using of any intermediate representation (Avro) we will have 
> to also implement engine-specific `FileReader`s
>  
> Initially, we will focus on Spark with other engines to follow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3309) Integrate quickstart examples into integration tests

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3309:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Integrate quickstart examples into integration tests
> 
>
> Key: HUDI-3309
> URL: https://issues.apache.org/jira/browse/HUDI-3309
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, tests-ci
>Reporter: Raymond Xu
>Priority: Minor
> Fix For: 0.12.1
>
>
> - create integration test suite for quickstart examples
> - make the code examples on website pages generated from the code



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3410) Revisit Record-reading Abstractions

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3410:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Revisit Record-reading Abstractions
> ---
>
> Key: HUDI-3410
> URL: https://issues.apache.org/jira/browse/HUDI-3410
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently, while the logic of combining the all Delta Log files (into a set 
> of delta-records) is commonly unified across all query engines. Actual 
> merging it with the base-files is not. 
> We need to revisit that to and make sure: 
>  * Record merging logic is shared across all Query engines
>  * There's no duplication of merging logic (currently merging log-files and 
> base-files are completely isolated)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3317) Partition specific pointed lookup/reading strategy for metadata table

2022-08-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3317:
--
Fix Version/s: 0.12.1
   (was: 0.12.0)

> Partition specific pointed lookup/reading strategy for metadata table
> -
>
> Key: HUDI-3317
> URL: https://issues.apache.org/jira/browse/HUDI-3317
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata, writer-core
>Reporter: Manoj Govindassamy
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.12.1
>
>
> Today inline reading can only be turned on for the entire metadata table. 
> Mean all partitions either have this feature enabled or not. But, for smaller 
> partitions like "files" inline is not preferable as it turns off external 
> spillable map caching of records. Where as for other partitions like 
> bloom_filters, inline reading is preferred. We need Partition specific inline 
> reading strategy for metadata table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 6 >

1 - 100 of 500 matches

Mail list logo