Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10617:
URL: https://github.com/apache/hudi/pull/10617#issuecomment-1928930674

   
   ## CI report:
   
   * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319)
 
   * 4b6f9845a2799ed74a6f0fef60519fc7c93371ea Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22337)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10617:
URL: https://github.com/apache/hudi/pull/10617#issuecomment-1928922485

   
   ## CI report:
   
   * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319)
 
   * 4b6f9845a2799ed74a6f0fef60519fc7c93371ea UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [BUG] S3 Deltastreamer: Block has already been inflated [hudi]

2024-02-05 Thread via GitHub


chestnutqiang commented on issue #6428:
URL: https://github.com/apache/hudi/issues/6428#issuecomment-1928891833

   same problems, version 0.14.1, on hdfs .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure CI 4th module [hudi]

2024-02-05 Thread via GitHub


linliu-code commented on code in PR #10512:
URL: https://github.com/apache/hudi/pull/10512#discussion_r1479262480


##
hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml:
##
@@ -92,6 +92,34 @@
 
 
 
+
+thrift-gen-source-with-script

Review Comment:
   In this change, we add a new profile to use maven-thrift-plugin to compile 
thrift in Azure CI container. In GH CI, we keep the original way since 
installing thrift would take about 10+ minutes. Therefore, we have two ways for 
thrift compilation: in Azure CI, we use maven-thrift-plugin; in GH CI, we use 
original script. To separate them, I create two profiles. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Support non-unique keys for secondary index [hudi]

2024-02-05 Thread via GitHub


codope commented on PR #10211:
URL: https://github.com/apache/hudi/pull/10211#issuecomment-1928815701

   Closing in favor of #10617 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Support non-unique keys for secondary index [hudi]

2024-02-05 Thread via GitHub


codope closed pull request #10211: [HUDI-7146] Support non-unique keys for 
secondary index
URL: https://github.com/apache/hudi/pull/10211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]

2024-02-05 Thread via GitHub


SudhirSaxena commented on issue #10587:
URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928800086

   Thanks @ad1happy2go  , i will follow these steps, and let you know..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]

2024-02-05 Thread via GitHub


ad1happy2go commented on issue #10587:
URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928794959

   Had a conversation with @SudhirSaxena on this and looked at his setup. He is 
using emr-6.15 with OSS hudi 0.14.0.
   
   1. With RLI enabled, the upsert job is getting stuck for hours, no progress. 
Also no useful logs. No running stage in UI.
   driver Logs - Above comment.
   2. We tried with RLI disabled keeping everything else same, But similar 
behaviour. So, RLI may not have any issue.
   
Next steps - 
   - Try creating a test script which do bulk insert and insert from quickstart 
and see if its working.
   - Try same setup with 0.14.1 version.
   - Try degrading EMR version which supports spark 3.3. Try running hudi 
0.14.0 if we see same behaviour.
   - Then just degrade hudi version to 0..12.3 which was used before and 
confirm if that works fine.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]

2024-02-05 Thread via GitHub


SudhirSaxena commented on issue #10587:
URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928793715

   @ad1happy2go as discussed for this issue, please find the driver logs, 
   24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 
KiB bytes)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_12999_piece0 
from memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_12999_piece0 to 
disk
   24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_12999_piece0 on 
disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original 
size: 0.0 B)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13000 from 
memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13000 to disk
   24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 
1342.0 KiB
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48457 stored as values 
in memory (estimated size 420.5 KiB, free 921.5 KiB)
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48457_piece0 stored as 
bytes in memory (estimated size 47.4 KiB, free 874.0 KiB)
   24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48457_piece0 in 
memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB)
   24/02/06 05:04:07 INFO SparkContext: Created broadcast 48457 from take at 
/mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191
   24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 
KiB bytes)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13000_piece0 
from memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13000_piece0 to 
disk
   24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13000_piece0 on 
disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original 
size: 0.0 B)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13001 from 
memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13001 to disk
   24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 
1342.0 KiB
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48458 stored as values 
in memory (estimated size 420.5 KiB, free 921.5 KiB)
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48458_piece0 stored as 
bytes in memory (estimated size 47.4 KiB, free 874.0 KiB)
   24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48458_piece0 in 
memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB)
   24/02/06 05:04:07 INFO SparkContext: Created broadcast 48458 from take at 
/mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191
   24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 
KiB bytes)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13001_piece0 
from memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13001_piece0 to 
disk
   24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13001_piece0 on 
disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original 
size: 0.0 B)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13002 from 
memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13002 to disk
   24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 
1342.0 KiB
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48459 stored as values 
in memory (estimated size 420.5 KiB, free 921.5 KiB)
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48459_piece0 stored as 
bytes in memory (estimated size 47.4 KiB, free 874.0 KiB)
   24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48459_piece0 in 
memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB)
   24/02/06 05:04:07 INFO SparkContext: Created broadcast 48459 from take at 
/mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191
   24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 
KiB bytes)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13002_piece0 
from memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13002_piece0 to 
disk
   24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13002_piece0 on 
disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original 
size: 0.0 B)
   24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13003 from 
memory
   24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13003 to disk
   24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 
1342.0 KiB
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48460 stored as values 
in memory (estimated size 420.5 KiB, free 921.5 KiB)
   24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48460_piece0 stored as 
bytes in memory (estimated size 47.4 KiB, free 874.0 KiB)
   24/02/06 05:04:07 INFO 

Re: [PR] [HUDI-6902] Containerize the Azure CI 4th module [hudi]

2024-02-05 Thread via GitHub


codope commented on code in PR #10512:
URL: https://github.com/apache/hudi/pull/10512#discussion_r1479200835


##
hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml:
##
@@ -92,6 +92,34 @@
 
 
 
+
+thrift-gen-source-with-script

Review Comment:
   Just for my understanding, why do we need a separate profile? Shouldn't the 
hudi-platform-service profile suffice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-9424]Support using local timezone when writing flink TIMESTAMP data [hudi]

2024-02-05 Thread via GitHub


danny0405 commented on PR #10594:
URL: https://github.com/apache/hudi/pull/10594#issuecomment-1928733456

   The travis tests still got falures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7338) Bump HBase, pulsar-client, and jetty version

2024-02-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7338.

Fix Version/s: 1.0.0
   Resolution: Fixed

Fixed via master branch: c1d47014ca0430b2e2f4c2225767f2754a4fab2c

> Bump HBase, pulsar-client, and jetty version
> 
>
> Key: HUDI-7338
> URL: https://issues.apache.org/jira/browse/HUDI-7338
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Shawn Chang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> There is a major CVE spotted in jetty/netty: 
> [https://nvd.nist.gov/vuln/detail/CVE-2023-44487]
>  
> Bumping the version can help mitigate the problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223)

2024-02-05 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c1d47014ca0 [HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223)
c1d47014ca0 is described below

commit c1d47014ca0430b2e2f4c2225767f2754a4fab2c
Author: Shawn Chang <42792772+c...@users.noreply.github.com>
AuthorDate: Mon Feb 5 19:43:50 2024 -0800

[HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223)

Co-authored-by: Shawn Chang 
---
 pom.xml | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/pom.xml b/pom.xml
index d9a87558939..3eeed340178 100644
--- a/pom.xml
+++ b/pom.xml
@@ -102,7 +102,7 @@
 
${fasterxml.spark3.version}
 2.0.0
 2.8.0
-2.10.2
+3.0.2
 
${pulsar.spark.scala12.version}
 2.4.5
 3.1.1.4
@@ -189,9 +189,9 @@
 log4j2-surefire.properties
 0.13.0
 4.6.7
-9.4.48.v20220622
+9.4.53.v20231009
 3.1.0-incubating
-2.4.9
+2.4.13
 1.4.199
 3.1.2
 false
@@ -476,6 +476,7 @@
   
org.apache.hbase.thirdparty:hbase-shaded-miscellaneous
   org.apache.hbase.thirdparty:hbase-shaded-netty
   
org.apache.hbase.thirdparty:hbase-shaded-protobuf
+  org.apache.hbase.thirdparty:hbase-unsafe
   org.apache.htrace:htrace-core4
   
   
com.fasterxml.jackson.module:jackson-module-afterburner



Re: [PR] [HUDI-7338] Upgrade Jetty, HBase, and pulsar-client [hudi]

2024-02-05 Thread via GitHub


danny0405 merged PR #10223:
URL: https://github.com/apache/hudi/pull/10223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Using MRO table and synchronizing to hive, Flink checkpoint failed, resulting in log files being unable to scroll to parquet files [hudi]

2024-02-05 Thread via GitHub


danny0405 commented on issue #10616:
URL: https://github.com/apache/hudi/issues/10616#issuecomment-1928729633

   Are you using append mode or upsert mode?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928656987

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22334)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7366] Fix HoodieLocation with encoded paths (#10602)

2024-02-05 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 590506752c1 [HUDI-7366] Fix HoodieLocation with encoded paths (#10602)
590506752c1 is described below

commit 590506752c1034183906526c4c414e7500953f1b
Author: Y Ethan Guo 
AuthorDate: Mon Feb 5 17:31:35 2024 -0800

[HUDI-7366] Fix HoodieLocation with encoded paths (#10602)
---
 .../main/java/org/apache/hudi/storage/HoodieLocation.java|  3 ++-
 .../java/org/apache/hudi/io/storage/TestHoodieLocation.java  | 12 
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java 
b/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java
index 3b3a05dc9b4..2073548b7d1 100644
--- a/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java
+++ b/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java
@@ -108,7 +108,8 @@ public class HoodieLocation implements 
Comparable, Serializable
   parentUri.getAuthority(),
   parentPathWithSeparator,
   null,
-  parentUri.getFragment()).resolve(normalizedChild);
+  parentUri.getFragment())
+  .resolve(new URI(null, null, normalizedChild, null, null));
   this.uri = new URI(
   parentUri.getScheme(),
   parentUri.getAuthority(),
diff --git 
a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java 
b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java
index 4c765d2cc3f..7c3af8741ba 100644
--- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java
+++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java
@@ -115,6 +115,18 @@ public class TestHoodieLocation {
 new HoodieLocation(new HoodieLocation(new URI("foo://bar/baz#bud")), 
"/fud#boo").toString());
   }
 
+  @Test
+  public void testEncoded() {
+// encoded character like `%2F` should be kept as is
+assertEquals(new HoodieLocation("s3://foo/bar/1%2F2%2F3"), new 
HoodieLocation("s3://foo/bar", "1%2F2%2F3"));
+assertEquals("s3://foo/bar/1%2F2%2F3", new HoodieLocation("s3://foo/bar", 
"1%2F2%2F3").toString());
+assertEquals(new HoodieLocation("s3://foo/bar/1%2F2%2F3"),
+new HoodieLocation(new HoodieLocation("s3://foo/bar"), "1%2F2%2F3"));
+assertEquals("s3://foo/bar/1%2F2%2F3",
+new HoodieLocation(new HoodieLocation("s3://foo/bar"), 
"1%2F2%2F3").toString());
+assertEquals("s3://foo/bar/1%2F2%2F3", new 
HoodieLocation("s3://foo/bar/1%2F2%2F3").toString());
+  }
+
   @Test
   public void testPathToUriConversion() throws URISyntaxException {
 assertEquals(new URI(null, null, "/foo?bar", null, null),



Re: [PR] [HUDI-7366] Fix HoodieLocation with encoded paths [hudi]

2024-02-05 Thread via GitHub


vinothchandar merged PR #10602:
URL: https://github.com/apache/hudi/pull/10602


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]

2024-02-05 Thread via GitHub


vinothchandar commented on code in PR #10586:
URL: https://github.com/apache/hudi/pull/10586#discussion_r1479114527


##
hudi-io/src/main/java/org/apache/hudi/storage/StorageConfiguration.java:
##
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.storage;
+
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.Serializable;
+
+/**
+ * Interface providing the storage configuration in type {@link T}.
+ *
+ * @param  type of storage configuration to provide.
+ */
+public abstract class StorageConfiguration implements Serializable {
+  /**
+   * @return the storage configuration.
+   */
+  public abstract T get();
+
+  /**
+   * @return a new copy of the storage configuration.
+   */
+  public abstract T newCopy();
+
+  /**
+   * Serializes the storage configuration.
+   * DO NOT change the signature, as required by {@link Serializable}.
+   *
+   * @param out stream to write.
+   * @throws IOException on I/O error.
+   */
+  public abstract void writeObject(ObjectOutputStream out) throws IOException;

Review Comment:
   Does this need to be `ObjectOutputStream` due to `Serializable`, right? We 
should have ability to control the binary serialization of this object , lets 
make sure of that?



##
hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.storage.hadoop;
+
+import org.apache.hudi.io.storage.TestStorageConfigurationBase;
+import org.apache.hudi.storage.StorageConfiguration;
+
+import org.apache.hadoop.conf.Configuration;
+
+import java.util.Map;
+
+/**
+ * Tests {@link HadoopStorageConfiguration}.
+ */
+public class TestHadoopStorageConfiguration extends 
TestStorageConfigurationBase {

Review Comment:
   again this is a test class. not a test by itself. fix naming?



##
hudi-io/src/main/java/org/apache/hudi/storage/StorageConfiguration.java:
##
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.storage;
+
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.io.Serializable;
+
+/**
+ * Interface providing the storage configuration in type {@link T}.
+ *
+ * @param  type of storage configuration to provide.
+ */
+public abstract 

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


bhasudha commented on PR #10624:
URL: https://github.com/apache/hudi/pull/10624#issuecomment-1928592401

   Tested it locally, the diagrams may need to be reduced in size since they 
feel little disproportionate as compared to other pages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928546567

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330)
 
   * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22334)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928492419

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330)
 
   * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928476735

   
   ## CI report:
   
   * e39968e5155283e2c25a31626732a1cdde634840 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (ff0e67f78df -> c098ebaf166)

2024-02-05 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from ff0e67f78df [HUDI-7351] Implement partition pushdown for glue (#10604)
 add c098ebaf166 [HUDI-7375] Disable a flaky test method (#10627)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/common/functional/TestHoodieLogFormat.java | 2 ++
 1 file changed, 2 insertions(+)



Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]

2024-02-05 Thread via GitHub


yihua merged PR #10627:
URL: https://github.com/apache/hudi/pull/10627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928111329

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928100117

   
   ## CI report:
   
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329)
 
   * e39968e5155283e2c25a31626732a1cdde634840 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10627:
URL: https://github.com/apache/hudi/pull/10627#issuecomment-1928100175

   
   ## CI report:
   
   * f86247ccd72b443975c8ab08b74300627641c5c8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22331)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-05 Thread via GitHub


Shubham21k opened a new issue, #10628:
URL: https://github.com/apache/hudi/issues/10628

   
   We are incrementally writing to a hudi table with insert_overwrite 
operations. Recently, We enabled Hudi metadata table for these tables. However 
after few days we started to encounter the `FileNotFoundException` issue while 
reading these tables from athena (with metadata listing enabled). Upon further 
investigation, we observed that the metadata contains older files that were 
cleaned up by the cleaner and are no longer available.
   
   
   
   Steps to reproduce the behavior:
   1. create a simple df and write to a hudi table incrementally with these 
properties
   ```
   hoodie.datasource.meta.sync.enable=true 
   
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 
   hoodie.write.markers.type=DIRECT 
   **hoodie.metadata.enable=true 
   hoodie.datasource.write.operation=insert_overwrite** 
   hoodie.datasource.write.partitionpath.field=cs_load_hr 
   hoodie.datasource.hive_sync.partition_fields=cs_load_hr 
   
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
 
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
   hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd/HH 
   
hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor
 
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor
 
   hoodie.parquet.compression.codec=snappy 
   hoodie.table.services.enabled=true 
   hoodie.rollback.using.markers=false 
   hoodie.commits.archival.batch=30 
   hoodie.archive.delete.parallelism=500 
   hoodie.index.type=SIMPLE 
   hoodie.clean.allow.multiple=false 
   hoodie.clean.async=true 
   hoodie.clean.automatic=true 
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS 
   hoodie.cleaner.commits.retained=3 
   hoodie.cleaner.parallelism=500 
   hoodie.cleaner.incremental.mode=true 
   hoodie.clean.max.commits=8 
   hoodie.archive.async=true 
   hoodie.archive.automatic=true 
   hoodie.archive.merge.enable=true 
   hoodie.archive.merge.files.batch.size=60 
   hoodie.keep.max.commits=10 
   hoodie.keep.min.commits=5
   ```
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath)
   
   2. after few incremental writes, some of the base files should be updated & 
metadata does not get updated properly, it continues to persist old files 
pointer as well.
   3. if you try reading the table using spark or athena, you will get 
FileNotFoundException 
   keep in mind to enable metadata while reading. upon disabling the metadata 
listing on the read side, there is no error and reads work fine.
   4. Note : We have observed this issue only for **insert_overwrite** 
operations. Upsert operation table's metadata gets updated correctly.
   
   **Expected behavior**
   
   It is expected that the hoodie metadata gets updated correctly.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.2.1
   
   * Hive version : NA
   
   * Hadoop version : 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   The timeline also contains replaceCommits for corrupted tables. (which are 
not present in case of upsert table) 
   
   ```
   $ aws s3 ls s3://tmp-data/investments_ctr_tbl/.hoodie/
  PRE .aux/
  PRE archived/
  PRE metadata/
   2023-12-08 13:32:17  0 .aux_$folder$
   2023-12-08 13:32:17  0 .schema_$folder$
   2023-12-08 13:32:17  0 .temp_$folder$
   2023-12-14 22:17:18   4678 20231214221641350.clean
   2023-12-14 22:17:11   3227 20231214221641350.clean.inflight
   2023-12-14 22:17:10   3227 20231214221641350.clean.requested
   2023-12-22 21:50:54   4439 2023114849300.clean
   2023-12-22 21:50:45   4337 2023114849300.clean.inflight
   2023-12-22 21:50:45   4337 2023114849300.clean.requested
   2023-12-30 21:51:16   4439 20231230214431936.clean
   2023-12-30 21:51:07   4337 20231230214431936.clean.inflight
   2023-12-30 21:51:07   4337 20231230214431936.clean.requested
   2024-01-07 21:53:30   4439 20240107215204594.clean
   2024-01-07 21:53:23   4337 20240107215204594.clean.inflight
   2024-01-07 21:53:22   4337 20240107215204594.clean.requested
   2024-01-15 21:55:00   4439 20240115215112126.clean
   2024-01-15 21:54:52   4337 20240115215112126.clean.inflight
   2024-01-15 21:54:52   4337 20240115215112126.clean.requested
   2024-01-23 21:46:53   4439 20240123214442067.clean
   2024-01-23 21:46:45   4337 20240123214442067.clean.inflight
   2024-01-23 21:46:45   4337 

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478878886


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.

Review Comment:
   @dipankarmazumdar can you also fix all occurrences for File Group, File 
Slice, Base File, Log File, etc to align on the casing, indicating these are 
hudi specific terms



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478877408


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table** 

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table** 

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table** 

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478873515


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table** 

Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928036693

   
   ## CI report:
   
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329)
 
   * e39968e5155283e2c25a31626732a1cdde634840 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478869049


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table** 

Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928026486

   
   ## CI report:
   
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329)
 
   * e39968e5155283e2c25a31626732a1cdde634840 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478861210


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.

Review Comment:
   ```suggestion
   File formats hold the raw data and are physically stored on the lake 
storage. Hudi operates on logical structures of File Groups and File Slices, 
which consist of Base File and Log Files. Base Files are compacted and 
optimized for reads and are augmented with Log Files for efficient append. 
Future updates aim to integrate diverse formats like unstructured data (e.g., 
images), and compatibility with different storage layers in event-streaming, 
OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a Log 
File as a sequence of blocks (data, delete, rollback). By making data available 
in open file formats (such as Parquet), Hudi enables users to bring any compute 
engine for specific workloads.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927870267

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 24839296069f8b228f31e7000c77a4630913dc07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318)
 
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception

2024-02-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-7360:
-
Priority: Blocker  (was: Critical)

> Incremental CDC Query after 0.14.1 upgrade giving Jackson class 
> incompatibility exception
> -
>
> Key: HUDI-7360
> URL: https://issues.apache.org/jira/browse/HUDI-7360
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query, reader-core
>Reporter: Aditya Goenka
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/10590]
> Reproducible code 
> ```
> from typing import Any
> from pyspark import Row
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col
> spark = SparkSession.builder \
> .appName("Hudi Basics") \
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
> .config("spark.jars.packages", 
> "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \
> .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
> .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
> .getOrCreate()
> sc = spark.sparkContext
> table_name = "hudi_trips_cdc"
> base_path = "/tmp/test_issue_10590_4" # Replace for whatever path
> quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils
> dataGen = quickstart_utils.DataGenerator()
> inserts = 
> sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
> def create_df():
> df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> return df
> def write_data():
> df = create_df()
> hudi_options = {
> "hoodie.table.name": table_name,
> "hoodie.datasource.write.recordkey.field": "uuid",
> "hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either 
> MoR or CoW and the error will still happen
> "hoodie.datasource.write.partitionpath.field": "partitionpath",
> "hoodie.datasource.write.table.name": table_name,
> "hoodie.datasource.write.operation": "upsert",
> "hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t 
> affect anything unless actually queried as CDC
> "hoodie.datasource.write.precombine.field": "ts",
> "hoodie.upsert.shuffle.parallelism": 2,
> "hoodie.insert.shuffle.parallelism": 2
> }
> df.write.format("hudi") \
> .options(**hudi_options) \
> .mode("overwrite") \
> .save(base_path)
> def update_data():
> updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10))
> df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
> df.write \
> .format("hudi") \
> .mode("append") \
> .save(base_path)
> def incremental_query():
> ordered_rows: list[Row] = spark.read \
> .format("hudi") \
> .load(base_path) \
> .select(col("_hoodie_commit_time").alias("commit_time")) \
> .orderBy(col("commit_time")) \
> .collect()
> commits: list[Any] = list(map(lambda row: row[0], ordered_rows))
> begin_time = commits[0]
> incremental_read_options = {
> 'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to 
> Query as CDC, crashes in 0.14.1
> 'hoodie.datasource.query.type': 'incremental',
> 'hoodie.datasource.read.begin.instanttime': begin_time,
> }
> trips_incremental_df = spark.read \
> .format("hudi") \
> .options(**incremental_read_options) \
> .load(base_path)
> # Error also occurs when using the "from_hudi_table_changes" in 0.14.1
> # sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', 
> 'earliest')"""
> # trips_incremental_df = spark.sql(sql_query)
> trips_incremental_df.show()
> trips_incremental_df.printSchema()
> if __name__ == "__main__":
> write_data()
> update_data()
> incremental_query()
> ```
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception

2024-02-05 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-7360:
-
Component/s: incremental-query

> Incremental CDC Query after 0.14.1 upgrade giving Jackson class 
> incompatibility exception
> -
>
> Key: HUDI-7360
> URL: https://issues.apache.org/jira/browse/HUDI-7360
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: incremental-query, reader-core
>Reporter: Aditya Goenka
>Priority: Critical
> Fix For: 1.1.0
>
>
> Github Issue - [https://github.com/apache/hudi/issues/10590]
> Reproducible code 
> ```
> from typing import Any
> from pyspark import Row
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col
> spark = SparkSession.builder \
> .appName("Hudi Basics") \
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
> .config("spark.jars.packages", 
> "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \
> .config("spark.sql.extensions", 
> "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
> .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
> .getOrCreate()
> sc = spark.sparkContext
> table_name = "hudi_trips_cdc"
> base_path = "/tmp/test_issue_10590_4" # Replace for whatever path
> quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils
> dataGen = quickstart_utils.DataGenerator()
> inserts = 
> sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
> def create_df():
> df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
> return df
> def write_data():
> df = create_df()
> hudi_options = {
> "hoodie.table.name": table_name,
> "hoodie.datasource.write.recordkey.field": "uuid",
> "hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either 
> MoR or CoW and the error will still happen
> "hoodie.datasource.write.partitionpath.field": "partitionpath",
> "hoodie.datasource.write.table.name": table_name,
> "hoodie.datasource.write.operation": "upsert",
> "hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t 
> affect anything unless actually queried as CDC
> "hoodie.datasource.write.precombine.field": "ts",
> "hoodie.upsert.shuffle.parallelism": 2,
> "hoodie.insert.shuffle.parallelism": 2
> }
> df.write.format("hudi") \
> .options(**hudi_options) \
> .mode("overwrite") \
> .save(base_path)
> def update_data():
> updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10))
> df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
> df.write \
> .format("hudi") \
> .mode("append") \
> .save(base_path)
> def incremental_query():
> ordered_rows: list[Row] = spark.read \
> .format("hudi") \
> .load(base_path) \
> .select(col("_hoodie_commit_time").alias("commit_time")) \
> .orderBy(col("commit_time")) \
> .collect()
> commits: list[Any] = list(map(lambda row: row[0], ordered_rows))
> begin_time = commits[0]
> incremental_read_options = {
> 'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to 
> Query as CDC, crashes in 0.14.1
> 'hoodie.datasource.query.type': 'incremental',
> 'hoodie.datasource.read.begin.instanttime': begin_time,
> }
> trips_incremental_df = spark.read \
> .format("hudi") \
> .options(**incremental_read_options) \
> .load(base_path)
> # Error also occurs when using the "from_hudi_table_changes" in 0.14.1
> # sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', 
> 'earliest')"""
> # trips_incremental_df = spark.sql(sql_query)
> trips_incremental_df.show()
> trips_incremental_df.printSchema()
> if __name__ == "__main__":
> write_data()
> update_data()
> incremental_query()
> ```
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10627:
URL: https://github.com/apache/hudi/pull/10627#issuecomment-1927859226

   
   ## CI report:
   
   * f86247ccd72b443975c8ab08b74300627641c5c8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22331)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927859155

   
   ## CI report:
   
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927858601

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 24839296069f8b228f31e7000c77a4630913dc07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318)
 
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7375] Disable a flaky test case [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10627:
URL: https://github.com/apache/hudi/pull/10627#issuecomment-1927736445

   
   ## CI report:
   
   * f86247ccd72b443975c8ab08b74300627641c5c8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927736186

   
   ## CI report:
   
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927701138

   
   ## CI report:
   
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927700598

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 24839296069f8b228f31e7000c77a4630913dc07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318)
 
   * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7375) Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7375:
-
Labels: pull-request-available  (was: )

> Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks
> 
>
> Key: HUDI-7375
> URL: https://issues.apache.org/jira/browse/HUDI-7375
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> Error:  testLogReaderWithDifferentVersionsOfDeleteBlocks{DiskMapType, 
> boolean, boolean, boolean}[13]  Time elapsed: 0.043 s  <<< ERROR!
> 3421org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /user/root/[13] BITCASK, false, true, 
> false1706913234251/partition_path/.test-fileid1_100.log.1_1-0-1 could only be 
> written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running 
> and 3 node(s) are excluded in this operation.
> 3422  at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2338)
> 3423  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
> 3424  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2989)
> 3425  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:911)
> 3426  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> 3427  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 3428  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
> 3429  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> 3430  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> 3431  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
> 3432  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
> 3433  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
> 3434  at java.security.AccessController.doPrivileged(Native Method)
> 3435  at javax.security.auth.Subject.doAs(Subject.java:422)
> 3436  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> 3437  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
> 3438
> 3439  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
> 3440  at org.apache.hadoop.ipc.Client.call(Client.java:1558)
> 3441  at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 3442  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
> 3443  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
> 3444  at jdk.proxy2/jdk.proxy2.$Proxy43.addBlock(Unknown Source)
> 3445  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:530)
> 3446  at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
> 3447  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 3448  at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> 3449  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> 3450  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> 3451  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> 3452  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> 3453  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> 3454  at jdk.proxy2/jdk.proxy2.$Proxy44.addBlock(Unknown Source)
> 3455  at 
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
> 3456  at 
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1915)
> 3457  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1717)
> 3458  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713)
> 3459 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7375] Disable a flaky test method [hudi]

2024-02-05 Thread via GitHub


linliu-code opened a new pull request, #10627:
URL: https://github.com/apache/hudi/pull/10627

   
   
   ### Change Logs
   
   Which is caused by issues from underlying MiniHDFS.
   We should target to fix the root cause; disable the method for now.
   
   ### Impact
   
   Unblock CI tests.
   
   ### Risk level (write none, low medium or high below)
   
   Low.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927681286

   
   ## CI report:
   
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927531831

   
   ## CI report:
   
   * 264059fcce703e1bde6c07bdce6ee106fcff30a6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22325)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]

2024-02-05 Thread via GitHub


yihua commented on code in PR #10619:
URL: https://github.com/apache/hudi/pull/10619#discussion_r1478603403


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java:
##
@@ -239,18 +240,25 @@ public List compact(HoodieCompactionHandler 
compactionHandler,
 scanner.close();
 Iterable> resultIterable = () -> result;
 return StreamSupport.stream(resultIterable.spliterator(), 
false).flatMap(Collection::stream).peek(s -> {
-  
s.getStat().setTotalUpdatedRecordsCompacted(scanner.getNumMergedRecordsInLog());
-  s.getStat().setTotalLogFilesCompacted(scanner.getTotalLogFiles());
-  s.getStat().setTotalLogRecords(scanner.getTotalLogRecords());
-  s.getStat().setPartitionPath(operation.getPartitionPath());
-  s.getStat()
+  final HoodieWriteStat stat = s.getStat();
+  stat.setTotalUpdatedRecordsCompacted(scanner.getNumMergedRecordsInLog());
+  stat.setTotalLogFilesCompacted(scanner.getTotalLogFiles());
+  stat.setTotalLogRecords(scanner.getTotalLogRecords());
+  stat.setPartitionPath(operation.getPartitionPath());
+  stat
   
.setTotalLogSizeCompacted(operation.getMetrics().get(CompactionStrategy.TOTAL_LOG_FILE_SIZE).longValue());
-  s.getStat().setTotalLogBlocks(scanner.getTotalLogBlocks());
-  s.getStat().setTotalCorruptLogBlock(scanner.getTotalCorruptBlocks());
-  s.getStat().setTotalRollbackBlocks(scanner.getTotalRollbacks());
+  stat.setTotalLogBlocks(scanner.getTotalLogBlocks());
+  stat.setTotalCorruptLogBlock(scanner.getTotalCorruptBlocks());
+  stat.setTotalRollbackBlocks(scanner.getTotalRollbacks());
   RuntimeStats runtimeStats = new RuntimeStats();
+  // scan time has to be obtained from scanner.
   
runtimeStats.setTotalScanTime(scanner.getTotalTimeTakenToReadAndMergeBlocks());
-  s.getStat().setRuntimeStats(runtimeStats);
+  // create and upsert time are obtained from the create or merge handle.
+  if (stat.getRuntimeStats() != null) {
+
runtimeStats.setTotalCreateTime(stat.getRuntimeStats().getTotalCreateTime());
+
runtimeStats.setTotalUpsertTime(stat.getRuntimeStats().getTotalUpsertTime());

Review Comment:
   Can we add a unit test around the runtime stats?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927504960

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323)
 
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges [hudi]

2024-02-05 Thread via GitHub


prashant462 opened a new issue, #10626:
URL: https://github.com/apache/hudi/issues/10626

   ### Issue Summary
   
   When using dbt Spark with Hudi to create a Hudi format table, there is an 
inconsistency in the Hudi table configuration between the initial insert and 
subsequent merge operations. The properties provided in the options of the dbt 
model are correctly fetched and applied during the first run. However, during 
the second run, when executing the merge operation, Hudi fetches a subset of 
the properties from the Hudi catalog table, leading to the addition of default 
properties and changes in configuration.
   
   
   ### Steps to Reproduce
   
   - Execute the dbt model with Hudi options for the initial insert.
   
  Sample model
  
{{
 config(
 materialized = 'incremental',
 file_format= 'hudi',
 pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true",
 location_root="file:///Users/B0279627/Downloads/Hudi",
 unique_key="id",
 incremental_strategy="merge",
 options={
 'preCombineField': 'id2',
 'hoodie.index.type':"GLOBAL_SIMPLE",
 'hoodie.simple.index.update.partition.path':'true',
 'hoodie.keep.min.commits':'145',
 'hoodie.keep.max.commits':'288',
 'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS',
 'hoodie.cleaner.hours.retained':'72',
 'hoodie.cleaner.fileversions.retained':'144',
 'hoodie.cleaner.commits.retained':'144',
 'hoodie.upsert.shuffle.parallelism':'200',
 'hoodie.insert.shuffle.parallelism':'200',
 'hoodie.bulkinsert.shuffle.parallelism':'200',
 'hoodie.delete.shuffle.parallelism':'200',
 'hoodie.parquet.compression.codec':'zstd',
 'hoodie.datasource.hive_sync.support_timestamp':'true',
 'hoodie.datasource.write.reconcile.schema':'true',
 'hoodie.enable.data.skipping':'true',
 
'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
 }
 )
 }}
   - Observe that all specified properties are correctly applied during the 
first run.
   - For observation you can check with sample property like 
hoodie.index.type=GLOBAL_SIMPLE
   - Execute the dbt model with Hudi options for a subsequent merge operation.
   - Observe changes in Hudi table properties, with defaults being applied for 
certain configurations like hoodie.index.type changed to SIMPLE (Target table 
created seems like following hoodie.index.type= SIMPLE)
   
   ### Expected Behavior
   Hudi should consistently set all specified properties in every run, 
irrespective of whether it is the initial insert or a subsequent merge 
operation. The properties passed in the options of the dbt model should be 
retained and applied consistently across all operations.
   
   ### Environment Description
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.1.1
   
   * DBT version: 1.7.1
   
   * Storage (HDFS/S3/GCS..) : Checked with s3 , hdfs and local file system.
   
   * Running on Docker? (yes/no) : no
   
   
   ### **Additional context**
   
   In the second run MergeIntohoodieTableCommand.scala executes 
InsertIntoHoodieTableCommand.run() in this case hudi fetch the props from 
hudicatalog table where it fetches tableConfigs and catalog properties. But 
they are not all that complete properties which I passed in the first run using 
dbt options. Due to which hudi add some other default properties in the hoodie 
props which are not fetched in the hudicatalog props . Seems due to this many 
properties are changing.
   Below i have attached some images of properties fetched in subsequent merge 
operations
   
   https://github.com/apache/hudi/assets/31952894/46126281-b95a-47a4-9116-66a093a97506;>
   https://github.com/apache/hudi/assets/31952894/80ba4206-77d0-4852-aaf1-fd0e19c91025;>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927384615

   
   ## CI report:
   
   * 264059fcce703e1bde6c07bdce6ee106fcff30a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22325)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927384527

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323)
 
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   * 36b0460dcb4c7ecc69d79d92befaab358a068d4e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10625:
URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927368985

   
   ## CI report:
   
   * 264059fcce703e1bde6c07bdce6ee106fcff30a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927353626

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323)
 
   * e69065c1325a38735b053108f72341db0cd31da9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7384) Implement writer path support for secondary index

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7384:
-
Labels: pull-request-available  (was: )

> Implement writer path support for secondary index
> -
>
> Key: HUDI-7384
> URL: https://issues.apache.org/jira/browse/HUDI-7384
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
>
> # Basic initialization ona. existing table
>  # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]

2024-02-05 Thread via GitHub


bhat-vinay opened a new pull request, #10625:
URL: https://github.com/apache/hudi/pull/10625

   … defined through options
   
   Initial commit. Supports the following features:
   1. Modify schema to ass secondary index to metadata
   2. New partition type  in the metadata table to store 
secondary_keys-to-record_keys mapping
   3. Various options to support secondary index enablement, column mappings 
(for secondary keys) etc
   4. Initialization of secondary keys
   5. Update secondary keys on inserts/upserts
   
   Supports only one secondary index at the moment. The PR is still a WIP and 
needs more work to handle deletions, proper merging, compaction, (re) 
clustering among other things.
   
   ### Change Logs
   
   Initial commit. Supports the following features:
   1. Modify schema to ass secondary index to metadata
   2. New partition type  in the metadata table to store 
secondary_keys-to-record_keys mapping
   3. Various options to support secondary index enablement, column mappings 
(for secondary keys) etc
   4. Initialization of secondary keys
   5. Update secondary keys on inserts/upserts
   
   Supports only one secondary index at the moment. The PR is still a WIP and 
needs more work to handle deletions, proper merging, compaction, (re) 
clustering among other things.
   
   ### Impact
   
   Support secondary index on columns (similar to record index, but for 
non-unique columns)
   
   ### Risk level (write none, low medium or high below)
   
   Medium. New and existing tests
   
   ### Documentation Update
   
   NA. Will be done later
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7384) Implement writer path support for secondary index

2024-02-05 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7384:
-

 Summary: Implement writer path support for secondary index
 Key: HUDI-7384
 URL: https://issues.apache.org/jira/browse/HUDI-7384
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Vinaykumar Bhat


# Basic initialization ona. existing table
 # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7384) Implement writer path support for secondary index

2024-02-05 Thread Vinaykumar Bhat (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7384:
-

Assignee: Vinaykumar Bhat

> Implement writer path support for secondary index
> -
>
> Key: HUDI-7384
> URL: https://issues.apache.org/jira/browse/HUDI-7384
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> # Basic initialization ona. existing table
>  # Handle inserts/upserts



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7383) CDC query failed due to dependency issue

2024-02-05 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-7383:


 Summary: CDC query failed due to dependency issue
 Key: HUDI-7383
 URL: https://issues.apache.org/jira/browse/HUDI-7383
 Project: Apache Hudi
  Issue Type: Bug
  Components: incremental-query
Affects Versions: 0.14.1, 0.14.0
Reporter: Raymond Xu


{code:java}
spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', 
'20240205084624923', '20240205091637412');
24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID 1515) 
(ip-10-0-117-21.us-west-2.compute.internal executor 3): 
java.lang.NoClassDefFoundError: 
org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$
    at 
org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237)
    at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
    at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 21 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927227972

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323)
 
   * e69065c1325a38735b053108f72341db0cd31da9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub


dipankarmazumdar opened a new pull request, #10624:
URL: https://github.com/apache/hudi/pull/10624

   ### Change Logs
   
   This PR adds a new page to the Hudi documentation called 'Apache Hudi Stack'
   
   ### Impact
   
   Adds a new page for clarity around Hudi's platform & architecture
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   Update is for documentation
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927134709

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10623:
URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927118936

   
   ## CI report:
   
   * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10621:
URL: https://github.com/apache/hudi/pull/10621#issuecomment-1927104255

   
   ## CI report:
   
   * 03e73542a48058577ff24fa42a6aebc1d4e2991e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22322)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink Table planner not loading problem [hudi]

2024-02-05 Thread via GitHub


vkhoroshko commented on issue #8265:
URL: https://github.com/apache/hudi/issues/8265#issuecomment-1927062770

   Hello,
   
   Is there any solution for this. I'm running Flink SQL client locally and it 
has flink-table-planner-loader-1.17.1.jar in the /opt/flink/lib folder (I'm 
using Docker).
   
   However, if Async Clustering is enabled I receive the same error as above:
   
   ```java.lang.ClassNotFoundException: 
org.apache.flink.table.planner.codegen.sort.SortCodeGenerator```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]

2024-02-05 Thread via GitHub


parisni opened a new pull request, #10623:
URL: https://github.com/apache/hudi/pull/10623

   ### Change Logs
   
   After few days in production it turns out glue has a hard limit of 
expression (= 2048 chars). This patch handle this case, by fallback to 
returning all existing partitions.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]

2024-02-05 Thread via GitHub


Amar1404 commented on issue #10309:
URL: https://github.com/apache/hudi/issues/10309#issuecomment-1926993882

   hi @ad1happy2go - In my case the table is in long and changed to double.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Apache Hudi Auto-Size During Writes is not Working for Flink SQL [hudi]

2024-02-05 Thread via GitHub


vkhoroshko opened a new issue, #10622:
URL: https://github.com/apache/hudi/issues/10622

   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use Flink SQL with the file below.
   
   **Current behavior**
   A separate parquet file is produced with every Flink commit (during 
checkpointing)
   
   **Expected behavior**
   Data is appended to existing parquet file(s) until max size threshold is met.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   * 0.14.1
   
   * Flink version :
   1.17.1
   
   * Storage (HDFS/S3/GCS..) :
   File System
   
   * Running on Docker? (yes/no) :
   yes
   
   
   **Additional context**
   The expectation (as depicted in Apache Hudi docs - 
https://hudi.apache.org/docs/file_sizing#auto-sizing-during-writes) is that 
with every flink commit (every minute) - a set of records will be accumulated 
and written to one of existing parquet files until parquet file max size 
threshold is met (in the example below is 5MB).
   However, what happens is that every commit results in a separate parquet 
file (~400KB size) which are accumulated and are never merged. Please, help.
   
   SQL file:
   ```
   SET 'parallelism.default' = '1';
   SET 'execution.checkpointing.interval' = '1m';
   
   CREATE TABLE datagen
   (
   id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
   data STRING
   ) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '5'
   );
   
   CREATE TABLE hudi_tbl
   (
   id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
   data STRING
   ) WITH (
 'connector' = 'hudi',
 'path' = 'file:///opt/hudi',
 'table.type' = 'COPY_ON_WRITE',
 'write.parquet.block.size' = '1',
 'write.operation' = 'insert',
 'write.parquet.max.file.size' = '5'
   );
   
   INSERT INTO hudi_tbl SELECT * from datagen;
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]

2024-02-05 Thread via GitHub


VitoMakarevich commented on issue #10590:
URL: https://github.com/apache/hudi/issues/10590#issuecomment-1926893180

   The same happens with streaming source - since `HoodieSourceOffset` has 
   `import com.fasterxml.jackson.module.scala.DefaultScalaModule`.
   As for 0.14.1 bundle - it has only
   `com.fasterxml.jackson.module.afterburner` from jackson.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10621:
URL: https://github.com/apache/hudi/pull/10621#issuecomment-1926762566

   
   ## CI report:
   
   * 03e73542a48058577ff24fa42a6aebc1d4e2991e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22322)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP] [HUDI-5823] [RFC-65] Update to the Partition TTL RFC [hudi]

2024-02-05 Thread via GitHub


geserdugarov closed pull request #10248: [WIP] [HUDI-5823] [RFC-65] Update to 
the Partition TTL RFC
URL: https://github.com/apache/hudi/pull/10248


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10621:
URL: https://github.com/apache/hudi/pull/10621#issuecomment-1926750611

   
   ## CI report:
   
   * 03e73542a48058577ff24fa42a6aebc1d4e2991e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10618:
URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926738703

   
   ## CI report:
   
   * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10619:
URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926738761

   
   ## CI report:
   
   * 9945ee19750336801b3b816710234deabfce3b63 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22321)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]

2024-02-05 Thread via GitHub


PrabhuJoseph commented on PR #10618:
URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926722425

   @danny0405 Could you review this patch when you get time. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


fhan688 opened a new pull request, #10621:
URL: https://github.com/apache/hudi/pull/10621

   ### Change Logs
   
   Get partitions from active timeline instead of listing when building 
clustering plan
   
   ### Impact
   
   New strategy to build clustering plan for Flink
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


fhan688 closed pull request #10620: [HUDI-7382] Get partitions from active 
timeline instead of listing when building clustering plan
URL: https://github.com/apache/hudi/pull/10620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7382) Get partitions from active timeline instead of listing when building clustering plan

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7382:
-
Labels: pull-request-available  (was: )

> Get partitions from active timeline instead of listing when building 
> clustering plan
> 
>
> Key: HUDI-7382
> URL: https://issues.apache.org/jira/browse/HUDI-7382
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: fhan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]

2024-02-05 Thread via GitHub


fhan688 opened a new pull request, #10620:
URL: https://github.com/apache/hudi/pull/10620

   ### Change Logs
   
   Get partitions from active timeline instead of listing when building 
clustering plan
   
   ### Impact
   
   New strategy to build clustering plan for flink
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7382) Get partitions from active timeline instead of listing when building clustering plan

2024-02-05 Thread fhan (Jira)
fhan created HUDI-7382:
--

 Summary: Get partitions from active timeline instead of listing 
when building clustering plan
 Key: HUDI-7382
 URL: https://issues.apache.org/jira/browse/HUDI-7382
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: fhan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-02-05 Thread via GitHub


maheshguptags commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-1926535894

   @ad1happy2go I tried without RLI, it is working fine. however, when I add 
the `RLI` index to the table, it starts failing. 
   I am not sure why RLi is causing errors while without any index it is 
working fine. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10617:
URL: https://github.com/apache/hudi/pull/10617#issuecomment-1926524438

   
   ## CI report:
   
   * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1926523966

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN
   * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN
   * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN
   * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN
   * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN
   * 24839296069f8b228f31e7000c77a4630913dc07 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10619:
URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926450038

   
   ## CI report:
   
   * 9945ee19750336801b3b816710234deabfce3b63 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22321)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10618:
URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926449973

   
   ## CI report:
   
   * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22320)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10619:
URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926439711

   
   ## CI report:
   
   * 9945ee19750336801b3b816710234deabfce3b63 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]

2024-02-05 Thread via GitHub


hudi-bot commented on PR #10618:
URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926439656

   
   ## CI report:
   
   * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org