Re: [PR] [HUDI-7938] Broadcast `SerializableConfiguration` to avoid NullPointerException in Kryo SerDe [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11626:
URL: https://github.com/apache/hudi/pull/11626#issuecomment-2227217009

   
   ## CI report:
   
   * 256044ead7c3ab3a1c69f3fa46e36417965bb837 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24840)
 
   * 01da75c614c6a3a50a9ecca4e4a1ce315886355f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24852)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7938] Broadcast `SerializableConfiguration` to avoid NullPointerException in Kryo SerDe [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11626:
URL: https://github.com/apache/hudi/pull/11626#issuecomment-2227207604

   
   ## CI report:
   
   * 256044ead7c3ab3a1c69f3fa46e36417965bb837 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24840)
 
   * 01da75c614c6a3a50a9ecca4e4a1ce315886355f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7938) HadoopStorageConfiguration is not properly broadcasted with PySpark

2024-07-13 Thread Geser Dugarov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Geser Dugarov updated HUDI-7938:

Summary: HadoopStorageConfiguration is not properly broadcasted with 
PySpark  (was: Missed HoodieSparkKryoRegistrar in Hadoop config by default)

> HadoopStorageConfiguration is not properly broadcasted with PySpark
> ---
>
> Key: HUDI-7938
> URL: https://issues.apache.org/jira/browse/HUDI-7938
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Geser Dugarov
>Assignee: Geser Dugarov
>Priority: Major
>  Labels: pull-request-available
>
> HUDI-7567 Add schema evolution to the filegroup reader (#10957),
> but broke integration with PySpark.
> When trying to call
> {quote}df_load = 
> spark.read.format({color:#067d17}"org.apache.hudi"{color}).load(tmp_dir_path)
> df_load.collect()
> {quote}
>  
> got:
>  
> {quote}24/06/28 11:22:06 WARN TaskSetManager: Lost task 1.0 in stage 27.0 
> (TID 31) (10.199.141.90 executor 0): java.lang.NullPointerException
>     at org.apache.hadoop.conf.Configuration.(Configuration.java:842)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:73)
>     at 
> org.apache.hudi.storage.hadoop.HadoopStorageConfiguration.unwrapCopy(HadoopStorageConfiguration.java:36)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:58)
>     at 
> org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(HoodieFileGroupReaderBasedParquetFileFormat.scala:197)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
>     at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>     at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)
>     at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:891)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>     at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>     at org.apache.spark.scheduler.Task.run(Task.scala:139)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750)
> {quote}
> Spark 3.4.3 was used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7938] Broadcast `SerializableConfiguration` to avoid NullPointerException in Kryo SerDe [hudi]

2024-07-13 Thread via GitHub


geserdugarov commented on PR #11626:
URL: https://github.com/apache/hudi/pull/11626#issuecomment-2227205668

   Changes in this MR shouldn't change results of Flink related CI 
(`ITTestDataStreamWrite.testWriteMergeOnReadWithCompaction`). Restarted CI by 
rebasing and force pushing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class

2024-07-13 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7976:
-
Fix Version/s: 1.1.0

> Fix BUG introduced in HUDI-7955 due to usage of wrong class
> ---
>
> Key: HUDI-7976
> URL: https://issues.apache.org/jira/browse/HUDI-7976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp 
> }}was used.
>  # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp
>  # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
>  
> !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235!
>  
> Submitting a bugfix to fix this bugfix... 
> Log levels for the exception block is also changed to warn so errors will be 
> printed out.
> On top of that, we have simplified the {{getMillis}} shim to remove the 
> method that was added in HUDI-7955 to standardise it with how {{getDays}} is 
> written.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7976) Fix BUG introduced in HUDI-7955 due to usage of wrong class

2024-07-13 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7976.

Resolution: Fixed

Fixed via master branch: 918c2e0009c054f9fcd4ca19ba3258c491483708

> Fix BUG introduced in HUDI-7955 due to usage of wrong class
> ---
>
> Key: HUDI-7976
> URL: https://issues.apache.org/jira/browse/HUDI-7976
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0
>
>
> In the bugfix for HUDI-7955, the wrong class for invoking {{getTimestamp 
> }}was used.
>  # {*}Wrong{*}: org.apache.hadoop.hive.common.type.Timestamp
>  # {*}Correct{*}: org.apache.hadoop.hive.serde2.io.TimestampWritableV2
>  
> !https://git.garena.com/shopee/data-infra/hudi/uploads/eeff29b3e741c65eeb48f9901fa28da0/image.png|width=468,height=235!
>  
> Submitting a bugfix to fix this bugfix... 
> Log levels for the exception block is also changed to warn so errors will be 
> printed out.
> On top of that, we have simplified the {{getMillis}} shim to remove the 
> method that was added in HUDI-7955 to standardise it with how {{getDays}} is 
> written.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class (#11612)

2024-07-13 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 918c2e0009c [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage 
of wrong class (#11612)
918c2e0009c is described below

commit 918c2e0009c054f9fcd4ca19ba3258c491483708
Author: voonhous 
AuthorDate: Sun Jul 14 11:29:44 2024 +0800

[HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class 
(#11612)
---
 .../hudi/hadoop/utils/HiveAvroSerializer.java  |  3 +-
 .../apache/hudi/hadoop/utils/HoodieHiveUtils.java  |  8 +---
 .../apache/hudi/hadoop/utils/shims/Hive2Shim.java  |  9 +
 .../apache/hudi/hadoop/utils/shims/Hive3Shim.java  | 45 ++
 .../apache/hudi/hadoop/utils/shims/HiveShim.java   |  4 +-
 5 files changed, 26 insertions(+), 43 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
index 47d984c89c3..0c3362ba981 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
@@ -304,8 +304,7 @@ public class HiveAvroSerializer {
   case DATE:
 return HoodieHiveUtils.getDays(structFieldData);
   case TIMESTAMP:
-Object timestamp = HoodieHiveUtils.getTimestamp(structFieldData);
-return HoodieHiveUtils.getMills(timestamp);
+return HoodieHiveUtils.getMills(structFieldData);
   case INT:
 if (schema.getLogicalType() != null && 
schema.getLogicalType().getName().equals("date")) {
   return new 
WritableDateObjectInspector().getPrimitiveWritableObject(structFieldData).getDays();
diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
index ced39ccf379..b4894c35d41 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
@@ -182,15 +182,11 @@ public class HoodieHiveUtils {
 return HIVE_SHIM.getDateWriteable(value);
   }
 
-  public static Object getTimestamp(Object fieldData) {
-return HIVE_SHIM.unwrapTimestampAsPrimitive(fieldData);
-  }
-
   public static int getDays(Object dateWritable) {
 return HIVE_SHIM.getDays(dateWritable);
   }
 
-  public static long getMills(Object timestamp) {
-return HIVE_SHIM.getMills(timestamp);
+  public static long getMills(Object timestampWritable) {
+return HIVE_SHIM.getMills(timestampWritable);
   }
 }
diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive2Shim.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive2Shim.java
index e2a4f36cb7f..7f4b683d246 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive2Shim.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive2Shim.java
@@ -42,11 +42,6 @@ public class Hive2Shim implements HiveShim {
 return new TimestampWritable(timestamp);
   }
 
-  @Override
-  public Object unwrapTimestampAsPrimitive(Object o) {
-return o == null ? null : ((TimestampWritable) o).getTimestamp();
-  }
-
   public Writable getDateWriteable(int value) {
 return new DateWritable(value);
   }
@@ -55,7 +50,7 @@ public class Hive2Shim implements HiveShim {
 return ((DateWritable) dateWritable).getDays();
   }
 
-  public long getMills(Object timestamp) {
-return ((Timestamp) timestamp).getTime();
+  public long getMills(Object timestampWritable) {
+return ((TimestampWritable) timestampWritable).getTimestamp().getTime();
   }
 }
diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive3Shim.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive3Shim.java
index 9d6dca4f2b3..bc5b7b3e124 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive3Shim.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/shims/Hive3Shim.java
@@ -36,11 +36,12 @@ public class Hive3Shim implements HiveShim {
 
   public static final Logger LOG = LoggerFactory.getLogger(Hive3Shim.class);
 
-  public static final String HIVE_TIMESTAMP_TYPE_CLASS = 
"org.apache.hadoop.hive.common.type.Timestamp";
-  public static final String TIMESTAMP_WRITEABLE_V2_CLASS = 
"org.apache.hadoop.hive.serde2.io.TimestampWritableV2";
-  public static final String DATE_WRITEABLE_V2_CLASS = 
"org.apache.hadoop.hive.serde2.io.DateWritableV2";
+  public static final String TIMESTAMP_CLASS_NAME = 
"org.apache.hadoop.hive.common.type.Timestamp";
+  public static final String TIMESTAMP_WRITEABLE_V2_CLASS_NAME = 
"org.apache.hadoo

Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


danny0405 merged PR #11612:
URL: https://github.com/apache/hudi/pull/11612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Release notes 1.0.0-beta2 [hudi]

2024-07-13 Thread via GitHub


codope commented on code in PR #11618:
URL: https://github.com/apache/hudi/pull/11618#discussion_r1676996631


##
website/docs/metadata.md:
##
@@ -90,6 +90,32 @@ Following are the different indices currently available 
under the metadata table
   Hudi release, this index aids in locating records faster than other existing 
indices and can provide a speedup orders of magnitude 
   faster in large deployments where index lookup dominates write latencies.
 
+ New Indexes in 1.0.0
+
+- ***Functional Index***:
+  A [functional 
index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md)
+  is an index on a function of a column. If a query has a predicate on a 
function of a column, the functional index can
+  be used to speed up the query. Functional index is stored in *func_index_* 
prefixed partitions (one for each
+  function) under metadata table. Functional index can be created using SQL 
syntax. Please checkout SQL DDL
+  docs [here](/docs/next/sql_ddl#create-functional-index) for more details.
+
+- ***Partition Stats Index***
+  Partition stats index aggregates statistics at the partition level for the 
columns for which it is enabled. This helps
+  in efficient partition pruning even for non-partition fields. The partition 
stats index is stored in *partition_stats*
+  partition under metadata table. Partition stats index can be enabled using 
the following configs (note it is required
+  to specify the columns for which stats should be aggregated):
+  ```properties
+hoodie.metadata.index.partition.stats.enable=true
+hoodie.metadata.index.column.stats.columns=
+  ```
+  
+- ***Secondary Index***:
+  Secondary indexes allow users to create indexes on columns that are not part 
of record key columns in Hudi tables (for
+  record key fields, Hudi supports [Record-level 
Index](/blog/2023/11/01/record-level-index). Secondary indexes
+  can be used to speed up queries with predicate on columns other than record 
key columns. 
+
+To try out these features, refer to the [SQL 
guide](/docs/next/sql_ddl#create-partition-stats-index).

Review Comment:
   yes i added in sql guide.. there is one section for partition stats and 
secondary index combined.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Release notes 1.0.0-beta2 [hudi]

2024-07-13 Thread via GitHub


codope commented on code in PR #11618:
URL: https://github.com/apache/hudi/pull/11618#discussion_r1676996508


##
website/releases/release-1.0.0-beta2.md:
##
@@ -0,0 +1,80 @@
+---
+title: "Release 1.0.0-beta2"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta2](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta2) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta2 is the second beta release of Apache Hudi. This 
release is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
We encourage users to try out the
+**1.0.0-beta2** features on new tables. The 1.0 general availability (GA) 
release will support automatic table upgrades
+from 0.x versions, while also ensuring full backward compatibility when 
reading 0.x Hudi tables using 1.0, ensuring a
+seamless migration experience.
+
+:::caution
+Given that timeline format and log file format has changed in this **beta 
release**, it is recommended not to attempt to do
+rolling upgrades from older versions to this release.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals,
+which are also partly covered in the [Hudi 1.0 tech 
specification](/tech-specs-1point0). The following are the main
+changes in this release:
+
+ Timeline
+
+No major changes in this release. Refer to 
[1.0.0-beta1#timeline](release-1.0.0-beta1.md#timeline) for more details.
+
+ Log File Format
+
+In addition to the fields in the log file header added in 
[1.0.0-beta1](release-1.0.0-beta1.md#log-file-format), we also
+store a flag, `IS_PARTIAL` to indicate whether the log block contains partial 
updates or not.
+
+### Metadata indexes
+
+In 1.0.0-beta1, we added support for functional index. In 1.0.0-beta2, we have 
added support for secondary indexes and
+partition stats index to the [multi-modal 
indexing](/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi)
 subsystem.
+
+ Secondary Indexes
+
+Secondary indexes allow users to create indexes on columns that are not part 
of record key columns in Hudi tables (for 
+record key fields, Hudi supports [Record-level 
Index](/blog/2023/11/01/record-level-index). Secondary indexes can be used to 
speed up
+queries with predicate on columns other than record key columns.
+
+ Partition Stats Index
+
+Partition stats index aggregates statistics at the partition level for the 
columns for which it is enabled. This helps
+in efficient partition pruning even for non-partition fields.
+
+To try out these features, refer to the [SQL 
guide](/docs/next/sql_ddl#create-partition-stats-index).
+
+### API Changes
+
+ Positional Merging
+
+In 1.0.0-beta1, we added a new [filegroup 
reader](/releases/release-1.0.0-beta1#new-filegroup-reader). The reader now
+provides position-based merging, as an alternative to existing key-based 
merging, and skipping pages based on record
+positions. The new filegroup reader is integrated with Spark and Hive, and 
enabled by default. To enable positional
+merging set below configs:
+
+```properties

Review Comment:
   We should enable by default. I guess there are still a few gaps. I have 
enabled in https://github.com/apache/hudi/pull/11620 and tracking failures.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-2227167064

   
   ## CI report:
   
   * 108e890a065a78c91d0bf28457b0bf2ec888e78b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24845)
 
   * 1e812dabbe90feeca9bd902654e92e1f8fc2de10 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24850)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-2227163520

   
   ## CI report:
   
   * 108e890a065a78c91d0bf28457b0bf2ec888e78b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24845)
 
   * 1e812dabbe90feeca9bd902654e92e1f8fc2de10 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [DOCS] Add doc update for HUDI-7962 (#11622)

2024-07-13 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6ad04164dac [DOCS] Add doc update for HUDI-7962 (#11622)
6ad04164dac is described below

commit 6ad04164dac86b2ee805845d77d28dd246130e40
Author: houyuting 
AuthorDate: Sun Jul 14 09:08:04 2024 +0800

[DOCS] Add doc update for HUDI-7962 (#11622)

Co-authored-by: houyuting 
---
 website/docs/sql_ddl.md | 12 
 1 file changed, 12 insertions(+)

diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md
index a85d8a7bb04..eebadfc580e 100644
--- a/website/docs/sql_ddl.md
+++ b/website/docs/sql_ddl.md
@@ -496,6 +496,18 @@ SHOW PARTITIONS hudi_table;
 --Drop partition:
 ALTER TABLE hudi_table DROP PARTITION (dt='2021-12-09', hh='10');
 ```
+### Show create table
+
+**Syntax**
+
+```sql
+SHOW CREATE TABLE tableIdentifier;
+```
+
+**Examples**
+```sql
+SHOW CREATE TABLE hudi_table;
+```
 
 ### Caveats 
 



Re: [PR] [DOCS] Add doc update for HUDI-7962 [hudi]

2024-07-13 Thread via GitHub


danny0405 merged PR #11622:
URL: https://github.com/apache/hudi/pull/11622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-13 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7980.

Resolution: Fixed

Fixed via master branch: 98b3d3bac0f31219e5b93b7528516b27b87ea699

> Optimize the configuration content when performing clustering with row writer
> -
>
> Key: HUDI-7980
> URL: https://issues.apache.org/jira/browse/HUDI-7980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, the row writer defaults to snapshot reads for all tables. However, 
> this method is relatively inefficient for MOR (Merge on Read) tables when 
> there are no logs. Therefore, we should optimize this part of the 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7980) Optimize the configuration content when performing clustering with row writer

2024-07-13 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7980:
-
Fix Version/s: 1.0.0

> Optimize the configuration content when performing clustering with row writer
> -
>
> Key: HUDI-7980
> URL: https://issues.apache.org/jira/browse/HUDI-7980
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, the row writer defaults to snapshot reads for all tables. However, 
> this method is relatively inefficient for MOR (Merge on Read) tables when 
> there are no logs. Therefore, we should optimize this part of the 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7980] Optimize the configuration content when performing clustering with row writer [hudi]

2024-07-13 Thread via GitHub


danny0405 merged PR #11614:
URL: https://github.com/apache/hudi/pull/11614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7980] Optimize the configuration content when performing clustering with row writer (#11614)

2024-07-13 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 98b3d3bac0f [HUDI-7980] Optimize the configuration content when 
performing clustering with row writer (#11614)
98b3d3bac0f is described below

commit 98b3d3bac0f31219e5b93b7528516b27b87ea699
Author: majian <47964462+majian1...@users.noreply.github.com>
AuthorDate: Sun Jul 14 09:06:37 2024 +0800

[HUDI-7980] Optimize the configuration content when performing clustering 
with row writer (#11614)
---
 .../run/strategy/MultipleSparkJobExecutionStrategy.java   | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
index 75b42491eda..47ccd8700a8 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
@@ -93,7 +93,6 @@ import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
 import static 
org.apache.hudi.client.utils.SparkPartitionUtils.getPartitionFieldVals;
-import static org.apache.hudi.common.config.HoodieCommonConfig.TIMESTAMP_AS_OF;
 import static 
org.apache.hudi.config.HoodieClusteringConfig.PLAN_STRATEGY_SORT_COLUMNS;
 import static 
org.apache.hudi.io.storage.HoodieSparkIOFactory.getHoodieSparkIOFactory;
 
@@ -438,8 +437,11 @@ public abstract class MultipleSparkJobExecutionStrategy
 .toArray(StoragePath[]::new);
 
 HashMap params = new HashMap<>();
-params.put("hoodie.datasource.query.type", "snapshot");
-params.put(TIMESTAMP_AS_OF.key(), instantTime);
+if (hasLogFiles) {
+  params.put("hoodie.datasource.query.type", "snapshot");
+} else {
+  params.put("hoodie.datasource.query.type", "read_optimized");
+}
 
 StoragePath[] paths;
 if (hasLogFiles) {



Re: [I] hive sql查询hudi分区表,如果分区字段不是表最后一列,解析parquet文件后返回的数据,没有查询分区字段单在分区字段列位置自动增加了分区字段的值,导致后续列错误发生类型转换问题 [hudi]

2024-07-13 Thread via GitHub


danny0405 commented on issue #11609:
URL: https://github.com/apache/hudi/issues/11609#issuecomment-2227155521

   Hive is a legacy repo and I don't think force the partiton fields in the 
last of the schema is the right behavior to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7986] Fix Duplicate handling behavior when Precombine value is not set [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11630:
URL: https://github.com/apache/hudi/pull/11630#issuecomment-2227123686

   
   ## CI report:
   
   * 544cb739d5fe30a5af0279a85d198167a85d0baf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24848)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7986] Fix Dupes behavior with Precombine [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11630:
URL: https://github.com/apache/hudi/pull/11630#issuecomment-2227103064

   
   ## CI report:
   
   * 544cb739d5fe30a5af0279a85d198167a85d0baf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24848)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7986] Fix Dupes behavior with Precombine [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11630:
URL: https://github.com/apache/hudi/pull/11630#issuecomment-2227101414

   
   ## CI report:
   
   * 544cb739d5fe30a5af0279a85d198167a85d0baf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7986) Make precombine field optional with Dedup feature for Mutable Streams

2024-07-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7986:
-
Labels: pull-request-available  (was: )

> Make precombine field optional with Dedup feature for Mutable Streams
> -
>
> Key: HUDI-7986
> URL: https://issues.apache.org/jira/browse/HUDI-7986
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sivaguru Kannan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7986] Fix Dupes behavior with Precombine [hudi]

2024-07-13 Thread via GitHub


csivaguru opened a new pull request, #11630:
URL: https://github.com/apache/hudi/pull/11630

   Opening a draft PR for OSS fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7986) Make precombine field optional with Dedup feature for Mutable Streams

2024-07-13 Thread Sivaguru Kannan (Jira)
Sivaguru Kannan created HUDI-7986:
-

 Summary: Make precombine field optional with Dedup feature for 
Mutable Streams
 Key: HUDI-7986
 URL: https://issues.apache.org/jira/browse/HUDI-7986
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Sivaguru Kannan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7985] Support more formats in timestamp logical types in Json Avro converter [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11629:
URL: https://github.com/apache/hudi/pull/11629#issuecomment-2227076436

   
   ## CI report:
   
   * 1f582f381e89945bce7b5b97e33fc2e66c0d7b5f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24847)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-2227076299

   
   ## CI report:
   
   * 108e890a065a78c91d0bf28457b0bf2ec888e78b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24845)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Failed to update metadata(hudi 0.15.0) [hudi]

2024-07-13 Thread via GitHub


MrAladdin commented on issue #11587:
URL: https://github.com/apache/hudi/issues/11587#issuecomment-2227070835

   @nsivabalan @ad1happy2go @danny0405 @codope  There are a large number of 
metadata-related jobs missing from the Spark UI due to an exceptional 
deltacommit that has been consistently in the INFLIGHT state. Furthermore, 
using the command `commit showfiles --commit exception_deltacommit_id` does not 
reveal any file write information. Additionally, the `.hoodie/.temp/` directory 
contains folders and data corresponding to these exception_deltacommit_ids.
   
   
   
   exception :
   
![Image](https://github.com/user-attachments/assets/e90f410d-89b9-40cb-b0d0-465f30c7ce57)
   
   normal :
   
![Image](https://github.com/user-attachments/assets/fde4ce58-0d4e-4ee4-8f56-c6688c0175f6)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] fix the target location for auxlib download in hudi CLI [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11628:
URL: https://github.com/apache/hudi/pull/11628#issuecomment-2227047795

   
   ## CI report:
   
   * d57782938c183d8d9ba3039e4be32c3a284fc89e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24846)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] docs: add release guide [hudi-rs]

2024-07-13 Thread via GitHub


codecov[bot] commented on PR #66:
URL: https://github.com/apache/hudi-rs/pull/66#issuecomment-2227038878

   ## 
[Codecov](https://app.codecov.io/gh/apache/hudi-rs/pull/66?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 87.19%. Comparing base 
[(`2c59bf1`)](https://app.codecov.io/gh/apache/hudi-rs/commit/2c59bf100c5e77df002edecb2bef8defaa5f209e?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`1c59693`)](https://app.codecov.io/gh/apache/hudi-rs/commit/1c596939be9ac6e3e2e8973e61bc14d670d6032d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   Additional details and impacted files
   
   
   ```diff
   @@   Coverage Diff   @@
   ## main  #66   +/-   ##
   ===
 Coverage   87.19%   87.19%   
   ===
 Files  13   13   
 Lines 687  687   
   ===
 Hits  599  599   
 Misses 88   88   
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/hudi-rs/pull/66?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] docs: add release guide [hudi-rs]

2024-07-13 Thread via GitHub


xushiyan opened a new pull request, #66:
URL: https://github.com/apache/hudi-rs/pull/66

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] fix the target location for auxlib download in hudi CLI [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11628:
URL: https://github.com/apache/hudi/pull/11628#issuecomment-2227033491

   
   ## CI report:
   
   * d57782938c183d8d9ba3039e4be32c3a284fc89e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24846)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7985] Support more formats in timestamp logical types in Json Avro converter [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11629:
URL: https://github.com/apache/hudi/pull/11629#issuecomment-2227033499

   
   ## CI report:
   
   * 1f582f381e89945bce7b5b97e33fc2e66c0d7b5f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24847)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-2227033459

   
   ## CI report:
   
   * 6ceca16530ca218d73a2624c18b09bd07b28b116 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24824)
 
   * 108e890a065a78c91d0bf28457b0bf2ec888e78b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24845)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7985] Support more formats in timestamp logical types in Json Avro converter [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11629:
URL: https://github.com/apache/hudi/pull/11629#issuecomment-2227015816

   
   ## CI report:
   
   * 1f582f381e89945bce7b5b97e33fc2e66c0d7b5f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] fix the target location for auxlib download in hudi CLI [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11628:
URL: https://github.com/apache/hudi/pull/11628#issuecomment-2227015741

   
   ## CI report:
   
   * d57782938c183d8d9ba3039e4be32c3a284fc89e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7976] Fix BUG introduced in HUDI-7955 due to usage of wrong class [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11612:
URL: https://github.com/apache/hudi/pull/11612#issuecomment-2227015499

   
   ## CI report:
   
   * 6ceca16530ca218d73a2624c18b09bd07b28b116 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24824)
 
   * 108e890a065a78c91d0bf28457b0bf2ec888e78b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi-common 0.14.0 jar in mavenCentral appears to have corrupt generated avro classes [hudi]

2024-07-13 Thread via GitHub


lucasmo commented on issue #11602:
URL: https://github.com/apache/hudi/issues/11602#issuecomment-2227007832

   Here is a reproducer script:
   ```bash
   #!/usr/bin/env bash
   MAVEN="https://repo1.maven.org/maven2";
   
   ARTIFACTS="\
   org/apache/avro/avro/1.11.3/avro-1.11.3.jar \
   com/fasterxml/jackson/core/jackson-core/2.17.1/jackson-core-2.17.1.jar \
   
com/fasterxml/jackson/core/jackson-databind/2.17.1/jackson-databind-2.17.1.jar \
   
com/fasterxml/jackson/core/jackson-annotations/2.17.1/jackson-annotations-2.17.1.jar
 \
   org/slf4j/slf4j-api/2.0.9/slf4j-api-2.0.9.jar \
   org/apache/hudi/hudi-common/0.14.0/hudi-common-0.14.0.jar \
   "
   
   CLASSPATH=""
   
   for artifact in $ARTIFACTS; do
 curl -O "${MAVEN}/${artifact}"
 jar=$(basename "$artifact")
 CLASSPATH="${CLASSPATH}:${jar}"
   done
   
   echo $CLASSPATH
   
   echo 'org.apache.avro.Schema schema = new 
org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"HoodieCleanPartitionMetadata\",\"namespace\":\"org.apache.hudi.avro.model\",\"fields\":[{\"name\":\"partitionPath\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"policy\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"deletePathPatterns\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"successDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"failedDeleteFiles\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}},{\"name\":\"isPartitionDeleted\",\"type\":[\"null\",\"boolean\"],\"default\":null}]}");
 System.out.println("Class for schema: " + 
org.apache.avro.specific.SpecificData.get().getClass(schema));' |\
   jshell --class-path "${CLASSPATH}"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Description: 
Following error is thrown when using Json Kafka Source with transformer and 
decimal is in the schema:



 

 

We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use \{{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (assuming that 
common use cases just have space character as the variant). 

  was:
We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use \{{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (assuming that 
common use cases just have space character as the variant). 


> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Following error is thrown when using Json Kafka Source with transformer and 
> decimal is in the schema:
>  
>  
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use \{{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateT

[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Description: 
Following error is thrown when using Json Kafka Source with transformer and 
decimal is in the schema:
{code:java}
Caused by: Json to Avro Type conversion error for field loaded_at, 2024-06-03 
13:42:34.951+00:00 for {"type":"long","logicalType":"timestamp-millis"}
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessorUtil$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:194)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessorUtil.convertToAvro(MercifulJsonConverter.java:204)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:182)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:126)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:107)
at 
org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJson(AvroConvertor.java:118)
... 43 more {code}
We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use \{{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (assuming that 
common use cases just have space character as the variant). 

  was:
Following error is thrown when using Json Kafka Source with transformer and 
decimal is in the schema:



 

 

We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use \{{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (assuming that 
common use cases just have space character as the variant). 


> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>  

[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7985:
-
Labels: pull-request-available  (was: )

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use \{{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse 
> timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in 
> {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
> with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
> space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a 
> simple twist of the formatter, it can be easily supported.
> My take is we should change the formatter of the timestamp logical types to 
> support zone offset and space character as the separator (which is backwards 
> compatible), instead of introducing a new config of format (assuming that 
> common use cases just have space character as the variant). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7985] Support more formats in timestamp logical types in Json Avro converter [hudi]

2024-07-13 Thread via GitHub


yihua opened a new pull request, #11629:
URL: https://github.com/apache/hudi/pull/11629

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] fix the target location for auxlib download in hudi CLI [hudi]

2024-07-13 Thread via GitHub


prabodh1194 opened a new pull request, #11628:
URL: https://github.com/apache/hudi/pull/11628

   ### Change Logs
   
   Using Hudi CLI to access tables on S3 has some limitations as the relevant 
`hadoop` jars are not defined in the path by default. I have updated the CLI 
utility to facilitate adding the hadoop s3 jars as well. For compatibility 
purpose, I have made this facility behind a flag called `IS_S3_ENABLED` which 
can be set to `true`. Enabling this flag, will add the hadoop jars to the 
`auxlib` as well.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   This page can be updated to highlight that the flag can be set to access 
`s3a` buckets path:
   https://hudi.apache.org/docs/next/cli/#using-hudi-cli
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Release notes 1.0.0-beta2 [hudi]

2024-07-13 Thread via GitHub


nsivabalan commented on code in PR #11618:
URL: https://github.com/apache/hudi/pull/11618#discussion_r1676856952


##
website/releases/release-1.0.0-beta2.md:
##
@@ -0,0 +1,80 @@
+---
+title: "Release 1.0.0-beta2"
+sidebar_position: 1
+layout: releases
+toc: true
+---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## [Release 
1.0.0-beta2](https://github.com/apache/hudi/releases/tag/release-1.0.0-beta2) 
([docs](/docs/next/quick-start-guide))
+
+Apache Hudi 1.0.0-beta2 is the second beta release of Apache Hudi. This 
release is meant for early adopters to try
+out the new features and provide feedback. The release is not meant for 
production use.
+
+## Migration Guide
+
+This release contains major format changes as we will see in highlights below. 
We encourage users to try out the
+**1.0.0-beta2** features on new tables. The 1.0 general availability (GA) 
release will support automatic table upgrades
+from 0.x versions, while also ensuring full backward compatibility when 
reading 0.x Hudi tables using 1.0, ensuring a
+seamless migration experience.
+
+:::caution
+Given that timeline format and log file format has changed in this **beta 
release**, it is recommended not to attempt to do
+rolling upgrades from older versions to this release.
+:::
+
+## Highlights
+
+### Format changes
+
+[HUDI-6242](https://issues.apache.org/jira/browse/HUDI-6242) is the main epic 
covering all the format changes proposals,
+which are also partly covered in the [Hudi 1.0 tech 
specification](/tech-specs-1point0). The following are the main
+changes in this release:
+
+ Timeline
+
+No major changes in this release. Refer to 
[1.0.0-beta1#timeline](release-1.0.0-beta1.md#timeline) for more details.
+
+ Log File Format
+
+In addition to the fields in the log file header added in 
[1.0.0-beta1](release-1.0.0-beta1.md#log-file-format), we also
+store a flag, `IS_PARTIAL` to indicate whether the log block contains partial 
updates or not.
+
+### Metadata indexes
+
+In 1.0.0-beta1, we added support for functional index. In 1.0.0-beta2, we have 
added support for secondary indexes and
+partition stats index to the [multi-modal 
indexing](/blog/2022/05/17/Introducing-Multi-Modal-Index-for-the-Lakehouse-in-Apache-Hudi)
 subsystem.
+
+ Secondary Indexes
+
+Secondary indexes allow users to create indexes on columns that are not part 
of record key columns in Hudi tables (for 
+record key fields, Hudi supports [Record-level 
Index](/blog/2023/11/01/record-level-index). Secondary indexes can be used to 
speed up
+queries with predicate on columns other than record key columns.
+
+ Partition Stats Index
+
+Partition stats index aggregates statistics at the partition level for the 
columns for which it is enabled. This helps
+in efficient partition pruning even for non-partition fields.
+
+To try out these features, refer to the [SQL 
guide](/docs/next/sql_ddl#create-partition-stats-index).
+
+### API Changes
+
+ Positional Merging
+
+In 1.0.0-beta1, we added a new [filegroup 
reader](/releases/release-1.0.0-beta1#new-filegroup-reader). The reader now
+provides position-based merging, as an alternative to existing key-based 
merging, and skipping pages based on record
+positions. The new filegroup reader is integrated with Spark and Hive, and 
enabled by default. To enable positional
+merging set below configs:
+
+```properties

Review Comment:
   not related to this doc PR. curious in general.
   if we have fallback mechanism to do key based merges if positional based 
merges are not possible, why not we enable this by default? 



##
website/docs/metadata.md:
##
@@ -90,6 +90,32 @@ Following are the different indices currently available 
under the metadata table
   Hudi release, this index aids in locating records faster than other existing 
indices and can provide a speedup orders of magnitude 
   faster in large deployments where index lookup dominates write latencies.
 
+ New Indexes in 1.0.0
+
+- ***Functional Index***:
+  A [functional 
index](https://github.com/apache/hudi/blob/3789840be3d041cbcfc6b24786740210e4e6d6ac/rfc/rfc-63/rfc-63.md)
+  is an index on a function of a column. If a query has a predicate on a 
function of a column, the functional index can
+  be used to speed up the query. Functional index is stored in *func_index_* 
prefixed partitions (one for each
+  function) under metadata table. Functional index can be created using SQL 
syntax. Please checkout SQL DDL
+  docs [here](/docs/next/sql_ddl#create-functional-index) for more details.
+
+- ***Partition Stats Index***
+  Partition stats index aggregates statistics at the partition level for the 
columns for which it is enabled. This helps
+  in efficient partition pruning even for non-partition fields. The partition 
stats index is stored in *partition_stats*
+  partition under metadata table. Partition stats index can be enabled using 
the following configs (note it is require

(hudi) branch master updated: [MINOR] Update DOAP with 1.0.0-beta2 Release (#11627)

2024-07-13 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new afc91515983 [MINOR] Update DOAP with 1.0.0-beta2 Release (#11627)
afc91515983 is described below

commit afc91515983badd91f671c14f3737fe034d96b9c
Author: Sagar Sumit 
AuthorDate: Sat Jul 13 22:25:34 2024 +0530

[MINOR] Update DOAP with 1.0.0-beta2 Release (#11627)
---
 doap_HUDI.rdf | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/doap_HUDI.rdf b/doap_HUDI.rdf
index 981b2619fb1..1f2b45a4899 100644
--- a/doap_HUDI.rdf
+++ b/doap_HUDI.rdf
@@ -189,6 +189,13 @@
 0.15.0
   
 
+
+  
+Apache Hudi 1.0.0-beta2
+2024-07-14
+1.0.0-beta2
+  
+
 
   
 https://github.com/apache/hudi.git"/>



Re: [PR] [MINOR] Update DOAP with 1.0.0-beta2 Release [hudi]

2024-07-13 Thread via GitHub


yihua merged PR #11627:
URL: https://github.com/apache/hudi/pull/11627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Update DOAP with 1.0.0-beta2 Release [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11627:
URL: https://github.com/apache/hudi/pull/11627#issuecomment-2226994880

   
   ## CI report:
   
   * 2d98b0b15bc44671cb1087f955a301854308cd9e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Description: 
We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use \{{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (assuming that 
common use cases just have space character as the variant). 

  was:
We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use {{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (for mitigating this 
incident this is no such need, and assuming that common use cases just have 
space character as the variant). 


> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use \{{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse 
> timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in 
> {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
> with zone offset (which is not supp

[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Description: 
We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in timestamp 
logical type.
 * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
{{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
 * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
separation character
 * There are systems that use {{ }} (space) instead of {{T}} as the separation 
(other parts are the same).  References indicate that ISO-8601 used to allow 
this by _mutual agreement_ 
([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
 
[ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse timestamps 
like {{2024-05-13T23:53:36.004Z}} , already supported in 
{{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
 * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a simple 
twist of the formatter, it can be easily supported.

My take is we should change the formatter of the timestamp logical types to 
support zone offset and space character as the separator (which is backwards 
compatible), instead of introducing a new config of format (for mitigating this 
incident this is no such need, and assuming that common use cases just have 
space character as the variant). 

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use {{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse 
> timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in 
> {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
> with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
> space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a 
> simple twist of the formatter, it can be easily supported.
> My take is we should change the formatter of the timestamp logical types to 
> support zone offset and space character as the separator (which is backwards 
> compatible), instead of introducing a new config of format (for mitigating 
> this incident this is no such need, and assuming that common use cases just 
> have space character as the variant). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Fix Version/s: 1.0.0

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> We need to make sure "2024-06-03 13:42:34.951+00:00" is supported in 
> timestamp logical type.
>  * ISO 8601 supports the zone offset in the standard, e.g., {{+01:00}} , and 
> {{Z}} is the zone offset equivalent to {{+00:00}} or UTC 
> ([ref1|https://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators])
>  * {{2011-12-03T10:15:30+01:00}} conforms to ISO 8601 with {{T}} as the 
> separation character
>  * There are systems that use \{{ }} (space) instead of {{T}} as the 
> separation (other parts are the same).  References indicate that ISO-8601 
> used to allow this by _mutual agreement_ 
> ([ref2|https://stackoverflow.com/questions/30201003/how-to-deal-with-optional-t-in-iso-8601-timestamp-in-java-8-jsr-310-threet],
>  
> [ref3|https://www.reddit.com/r/ISO8601/comments/173r61j/t_vs_space_separation_of_date_and_time/])
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} can successfully parse 
> timestamps like {{2024-05-13T23:53:36.004Z}} , already supported in 
> {{{}MercifulJsonConverter{}}}, and additionally {{2011-12-03T10:15:30+01:00}} 
> with zone offset (which is not supported in {{MercifulJsonConverter}} yet)
>  * {{DateTimeFormatter.ISO_OFFSET_DATE_TIME}} cannot parse the timestamp with 
> space as the separator, like {{2011-12-03 10:15:30+01:00}} .  But with a 
> simple twist of the formatter, it can be easily supported.
> My take is we should change the formatter of the timestamp logical types to 
> support zone offset and space character as the separator (which is backwards 
> compatible), instead of introducing a new config of format (assuming that 
> common use cases just have space character as the variant). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7985:
---

Assignee: Ethan Guo

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7985:

Status: In Progress  (was: Open)

> Support more formats in timestamp logical types in Json Avro converter
> --
>
> Key: HUDI-7985
> URL: https://issues.apache.org/jira/browse/HUDI-7985
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7985) Support more formats in timestamp logical types in Json Avro converter

2024-07-13 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7985:
---

 Summary: Support more formats in timestamp logical types in Json 
Avro converter
 Key: HUDI-7985
 URL: https://issues.apache.org/jira/browse/HUDI-7985
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [MINOR] Update DOAP with 1.0.0-beta2 Release [hudi]

2024-07-13 Thread via GitHub


codope opened a new pull request, #11627:
URL: https://github.com/apache/hudi/pull/11627

   ### Change Logs
   
   This PR updates DOAP with 1.0.0-beta2 Release for record keeping.
   
   ### Impact
   
   Publish new release version
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-1698) Multiwriting for Flink / Java

2024-07-13 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-1698.
-
Resolution: Fixed

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, writer-core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0-beta2, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-13 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7950.
-
Resolution: Fixed

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (e2860cddf54 -> 11304fd93ba)

2024-07-13 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e2860cddf54 [HUDI-7709] Pass partition paths as partition column 
values if `TimestampBasedKeyGenerator` is used (#11615)
 add 11304fd93ba [HUDI-7950] Shade roaring bitmap dependency in root POM 
(#11561)

No new revisions were added by this update.

Summary of changes:
 packaging/hudi-spark-bundle/pom.xml | 5 -
 packaging/hudi-utilities-bundle/pom.xml | 5 -
 pom.xml | 6 ++
 3 files changed, 6 insertions(+), 10 deletions(-)



Re: [PR] [HUDI-7950] Shade roaring bitmap dependency in root POM [hudi]

2024-07-13 Thread via GitHub


codope merged PR #11561:
URL: https://github.com/apache/hudi/pull/11561


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7928) Fix shared HFile reader in HoodieNativeAvroHFileReader

2024-07-13 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7928:
--
Fix Version/s: (was: 1.0.0-beta2)

> Fix shared HFile reader in HoodieNativeAvroHFileReader
> --
>
> Key: HUDI-7928
> URL: https://issues.apache.org/jira/browse/HUDI-7928
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> The shared HFile reader in HoodieNativeAvroHFileReader uses significant 
> memory for reading meta info from the HFile.  We should avoid keeping the 
> reference to the shared HFile reader and cache the meta info only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7950) Shade roaring bitmap dependency in root POM

2024-07-13 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7950:
--
Fix Version/s: (was: 1.0.0-beta2)

> Shade roaring bitmap dependency in root POM
> ---
>
> Key: HUDI-7950
> URL: https://issues.apache.org/jira/browse/HUDI-7950
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>
> We should unify the shading rule of roaring bitmap dependency in the root POM 
> for consistency among bundles.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


svn commit: r70286 - in /release/hudi/1.0.0-beta2: ./ hudi-1.0.0-beta2.src.tgz hudi-1.0.0-beta2.src.tgz.asc hudi-1.0.0-beta2.src.tgz.sha512

2024-07-13 Thread codope
Author: codope
Date: Sat Jul 13 15:46:12 2024
New Revision: 70286

Log:
Adding source release for version 1.0.0-beta2

Added:
release/hudi/1.0.0-beta2/
release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz   (with props)
release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc
release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512

Added: release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz
==
Binary file - no diff available.

Propchange: release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz
--
svn:mime-type = application/octet-stream

Added: release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc
==
--- release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc (added)
+++ release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.asc Sat Jul 13 15:46:12 
2024
@@ -0,0 +1,14 @@
+-BEGIN PGP SIGNATURE-
+
+iQGzBAABCAAdFiEE/SFTQuMZlBmt+/Qd1GI+OqFtdbAFAmaR+iwACgkQ1GI+OqFt
+dbAkywv8Ct3Nms+FpvKuXIv+0pFx+sw9264H4JJKK5ONtb25Rg4Dg7+OcZDVc8Q6
+KTdS68Ulf95fUQSuJmH+F9Lr6kRBjy478KPe9w4WswDd3b17gQeT9RHmUsRxYfY1
+w9CAS1bEkPeZyases+d4AebYpaoEEB3PZJ+9zXFbBts2GxwtGx4/m32qdJVdkqX6
+mUAWXUKg9eo8skOK78QFkopLqm1/yP/JOnLNG7uJ4X8j1pfXzr0e0ACvpbkb+UJu
+nTTA4TJ9iBuPnV1GeF4kZlsKjRpep+qPrOKlGXNPl7nJOVZ0Ca5OeNG+zJEVD2Ql
+Rsg0fUtLrlNh4UR+gzvDb2sQ+bTLWLJ+xsFJj+XP7FjkiJiG1JpH4lwZPXiKgxPz
+qM/Xto0ufmpgjYlK+C8bGFjJ491/nBhGxqsT8IY6V0A0hssd1LXZUo9s95LA6Op5
+FQdaRkjNvOrhm6VnnP2aa1G/4fQ7Uxtu9da6rcVwYJxB7QotgIqI3K9GVy6A16UH
+vtK1kqis
+=4w25
+-END PGP SIGNATURE-

Added: release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512
==
--- release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512 (added)
+++ release/hudi/1.0.0-beta2/hudi-1.0.0-beta2.src.tgz.sha512 Sat Jul 13 
15:46:12 2024
@@ -0,0 +1 @@
+69e382e7415d2df60d66f9b2b9d30f310ae168d49c4e6f617188acd2e9246f66619b692d0d7b81c90407ac757b658719da303cf5d69c20289156ff64a9271271
  hudi-1.0.0-beta2.src.tgz




Re: [PR] [HUDI-7938] Broadcast `SerializableConfiguration` to avoid NullPointerException in Kryo SerDe [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11626:
URL: https://github.com/apache/hudi/pull/11626#issuecomment-2226825138

   
   ## CI report:
   
   * 256044ead7c3ab3a1c69f3fa46e36417965bb837 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24840)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11562:
URL: https://github.com/apache/hudi/pull/11562#issuecomment-2226813999

   
   ## CI report:
   
   * 086b1d836fd0af91e3cc2a41913bf3e92653bf78 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24841)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-13 Thread via GitHub


hudi-bot commented on PR #11562:
URL: https://github.com/apache/hudi/pull/11562#issuecomment-2226812409

   
   ## CI report:
   
   * 013aef32a3ad3aa995beb626f5855d9a05234cbf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24689)
 
   * 086b1d836fd0af91e3cc2a41913bf3e92653bf78 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7882][WIP] Adding RFC 78 for bridge release to assist users to migrate to 1.x from 0.x [hudi]

2024-07-13 Thread via GitHub


nsivabalan commented on code in PR #11514:
URL: https://github.com/apache/hudi/pull/11514#discussion_r1676778486


##
rfc/rfc-78/rfc-78.md:
##
@@ -0,0 +1,339 @@
+
+# RFC-76: [Bridge release for 1.x]
+
+## Proposers
+
+- @nsivabalan
+- @vbalaji
+
+## Approvers
+ - @yihua
+ - @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7882
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+[Hudi 
1.x](https://github.com/apache/hudi/blob/ae1ee05ab8c2bd732e57bee11c8748926b05ec4b/rfc/rfc-69/rfc-69.md)
 is a powerful 
+re-imagination of the transactional database layer in Hudi to power continued 
innovation across the community in the coming 
+years. It introduces lot of differentiating features for Apache Hudi. Feel 
free to checkout the 
+[release page](https://hudi.apache.org/releases/release-1.0.0-beta1) for more 
info. We had beta1 and beta2 releases which was meant for 
+interested developers/users to give a spin on some of the  advanced features. 
But as we are working towards 1.0 GA, we are proposing 
+a bridge release (0.16.0) for smoother migration for existing hudi users. 
+
+## Objectives 
+Goal is to have a smooth migration experience for the users from 0.x to 1.0. 
We plan to have a 0.16.0 bridge release asking everyone to first migrate to 
0.16.0 before they can upgrade to 1.x. 
+
+A typical organization might have a medallion architecture deployed to run 
1000s of Hudi pipelines i.e. bronze, silver and gold layer. 
+For this layout of pipelines, here is how a typical migration might look 
like(w/o a bridge release)
+
+a. Existing pipelines are in 0.15.x. (bronze, silver, gold) 
+b. Migrate gold pipelines to 1.x. 
+- We need to strictly migrate only gold to 1x. Bcoz, a 0.15.0 reader may not 
be able to read 1.x hudi tables. So, if we migrate any of silver pipelines to 
1.x before migrating entire gold layer, we might end up in a situation, 
+where a 0.15.0 reader (gold) might end up reading 1.x table (silver). This 
might lead to failures. So, we have to follow certain order in which we migrate 
pipelines. 
+c. Once all of gold is migrated to 1.x, we can move all of silver to 1.x. 
+d. Once all of gold and silver pipelines are migrated to 1.x, finally we can 
move all of bronze to 1.x.
+
+In the end, we would have migrated all of existing hudi pipelines from 0.15.0 
to 1.x. 
+But as you could see, we need some coordination with which we need to migrate. 
And in a very large organization, sometimes we may not have good control over 
downstream consumers. 
+Hence, coordinating entire migration workflow and orchestrating the same might 
be challenging.
+
+Hence to ease the migration workflow for 1.x, we are introducing 0.16.0 as a 
bridge release.  
+
+Here are the objectives with this bridge release:
+
+- 1.x reader should be able to read 0.14.x to 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed. 
+But for new features that was introduced in 1.x, we may not be able to support 
all of them. Will be calling out which new features may not work with 0.16.x 
reader. 
+- In this case, we explicitly request users to not turn on these features 
untill all readers are completely migrated to 1.x so as to not break any 
readers as applicable. 
+
+Connecting back to our example above, lets see how the migration might look 
like for an existing user. 
+
+a. Existing pipelines are in 0.15.x. (bronze, silver, gold)
+b. Migrate pipelines to 0.16.0 (in any order. we do not have any constraints 
around which pipeline should be migrated first). 
+c. Ensure all pipelines are in 0.16.0 (both readers and writers)
+d. Start migrating pipelines in a rolling fashion to 1.x. At this juncture, we 
could have few pipelines in 1.x and few pipelines in 0.16.0. but since 0.16.x 
+can read 1.x tables, we should be ok here. Just that do not enable new 
features like Non blocking concurrency control yet. 
+e. Migrate all of 0.16.0 to 1.x version. 
+f. Once all readers and writers are in 1.x, we are good to enable any new 
features (like NBCC) with 1.x tables.
+
+As you could see, company/org wide coordination to migrate gold before 
migrating silver or bronze is relaxed with the bridge release. Only requirement 
to keep a tab on, 
+is to ensure to migrate all pipelines completely to 0.16.x before starting to 
migrate to 1.x.
+
+So, here are the objectives of this RFC with the bridge release. 
+- 1.x reader should be able to read 0.14.x to 0.16.x tables w/o any loss in 
functionality and no data inconsistencies.
+- 0.16.x should have read capability for 1.x tables w/ some limitations. For 
features ported over from 0.x, no loss in functionality should be guaranteed.
+  But for new features that are being introduced in 1.x, we may not be able to 
support all of them. Will be calling out which new featur

Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-13 Thread via GitHub


nsivabalan commented on code in PR #11562:
URL: https://github.com/apache/hudi/pull/11562#discussion_r1676778058


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java:
##
@@ -103,12 +106,30 @@ public void addBaseFile(HoodieBaseFile dataFile) {
* Add a new log file into the group.
*/
   public void addLogFile(HoodieLogFile logFile) {
-if (!fileSlices.containsKey(logFile.getBaseCommitTime())) {
-  fileSlices.put(logFile.getBaseCommitTime(), new FileSlice(fileGroupId, 
logFile.getBaseCommitTime()));
+String baseInstantTime = getBaseInstantTime(logFile);

Review Comment:
   changes to accomodate for file slice determination for both 0.x and 1x log 
files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-13 Thread via GitHub


nsivabalan commented on code in PR #11562:
URL: https://github.com/apache/hudi/pull/11562#discussion_r1676775310


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java:
##
@@ -38,18 +39,22 @@
  */
 public class HoodieInstant implements Serializable, Comparable {
 
-  // Instant like 20230104152218702.commit.request, 20230104152218702.inflight
+  // Instant like 20230104152218702.commit.request, 20230104152218702.inflight 
and 20230104152218702_20230104152630238.commit
   private static final Pattern NAME_FORMAT =
-  Pattern.compile("^(\\d+)(\\.\\w+)(\\.\\D+)?$");
+  Pattern.compile("^(\\d+(_\\d+)?)(\\.\\w+)(\\.\\D+)?$");
 
   private static final String DELIMITER = ".";
 
+  private static final String UNDERSCORE = "_";
+
   private static final String FILE_NAME_FORMAT_ERROR =
   "The provided file name %s does not conform to the required format";
 
+  private boolean completionTimeMissing = false;

Review Comment:
   NTR: 
   this will help deduce the completed commit file name. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DNM] Temp diff testing 1.x reads with 0.x branch [hudi]

2024-07-13 Thread via GitHub


nsivabalan commented on code in PR #11562:
URL: https://github.com/apache/hudi/pull/11562#discussion_r1676775154


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -554,21 +589,6 @@ public String toString() {
 return this.getClass().getName() + ": " + 
getInstantsAsStream().map(Object::toString).collect(Collectors.joining(","));
   }
 
-  /**
-   * Merge this timeline with the given timeline.
-   */
-  public HoodieDefaultTimeline mergeTimeline(HoodieDefaultTimeline timeline) {

Review Comment:
   moved below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org