Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]
hudi-bot commented on PR #10617: URL: https://github.com/apache/hudi/pull/10617#issuecomment-1928930674 ## CI report: * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319) * 4b6f9845a2799ed74a6f0fef60519fc7c93371ea Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22337) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]
hudi-bot commented on PR #10617: URL: https://github.com/apache/hudi/pull/10617#issuecomment-1928922485 ## CI report: * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319) * 4b6f9845a2799ed74a6f0fef60519fc7c93371ea UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [BUG] S3 Deltastreamer: Block has already been inflated [hudi]
chestnutqiang commented on issue #6428: URL: https://github.com/apache/hudi/issues/6428#issuecomment-1928891833 same problems, version 0.14.1, on hdfs . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure CI 4th module [hudi]
linliu-code commented on code in PR #10512: URL: https://github.com/apache/hudi/pull/10512#discussion_r1479262480 ## hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml: ## @@ -92,6 +92,34 @@ + +thrift-gen-source-with-script Review Comment: In this change, we add a new profile to use maven-thrift-plugin to compile thrift in Azure CI container. In GH CI, we keep the original way since installing thrift would take about 10+ minutes. Therefore, we have two ways for thrift compilation: in Azure CI, we use maven-thrift-plugin; in GH CI, we use original script. To separate them, I create two profiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Support non-unique keys for secondary index [hudi]
codope commented on PR #10211: URL: https://github.com/apache/hudi/pull/10211#issuecomment-1928815701 Closing in favor of #10617 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Support non-unique keys for secondary index [hudi]
codope closed pull request #10211: [HUDI-7146] Support non-unique keys for secondary index URL: https://github.com/apache/hudi/pull/10211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]
SudhirSaxena commented on issue #10587: URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928800086 Thanks @ad1happy2go , i will follow these steps, and let you know.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]
ad1happy2go commented on issue #10587: URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928794959 Had a conversation with @SudhirSaxena on this and looked at his setup. He is using emr-6.15 with OSS hudi 0.14.0. 1. With RLI enabled, the upsert job is getting stuck for hours, no progress. Also no useful logs. No running stage in UI. driver Logs - Above comment. 2. We tried with RLI disabled keeping everything else same, But similar behaviour. So, RLI may not have any issue. Next steps - - Try creating a test script which do bulk insert and insert from quickstart and see if its working. - Try same setup with 0.14.1 version. - Try degrading EMR version which supports spark 3.3. Try running hudi 0.14.0 if we see same behaviour. - Then just degrade hudi version to 0..12.3 which was used before and confirm if that works fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Upsert operation not working and job is running longer while using "Record level index" in Apache Hudi 0.14 in EMR 6.15 [hudi]
SudhirSaxena commented on issue #10587: URL: https://github.com/apache/hudi/issues/10587#issuecomment-1928793715 @ad1happy2go as discussed for this issue, please find the driver logs, 24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 KiB bytes) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_12999_piece0 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_12999_piece0 to disk 24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_12999_piece0 on disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original size: 0.0 B) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13000 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13000 to disk 24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 1342.0 KiB 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48457 stored as values in memory (estimated size 420.5 KiB, free 921.5 KiB) 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48457_piece0 stored as bytes in memory (estimated size 47.4 KiB, free 874.0 KiB) 24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48457_piece0 in memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB) 24/02/06 05:04:07 INFO SparkContext: Created broadcast 48457 from take at /mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191 24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 KiB bytes) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13000_piece0 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13000_piece0 to disk 24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13000_piece0 on disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original size: 0.0 B) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13001 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13001 to disk 24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 1342.0 KiB 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48458 stored as values in memory (estimated size 420.5 KiB, free 921.5 KiB) 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48458_piece0 stored as bytes in memory (estimated size 47.4 KiB, free 874.0 KiB) 24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48458_piece0 in memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB) 24/02/06 05:04:07 INFO SparkContext: Created broadcast 48458 from take at /mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191 24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 KiB bytes) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13001_piece0 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13001_piece0 to disk 24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13001_piece0 on disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original size: 0.0 B) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13002 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13002 to disk 24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 1342.0 KiB 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48459 stored as values in memory (estimated size 420.5 KiB, free 921.5 KiB) 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48459_piece0 stored as bytes in memory (estimated size 47.4 KiB, free 874.0 KiB) 24/02/06 05:04:07 INFO BlockManagerInfo: Added broadcast_48459_piece0 in memory on ip-10-156-17-116.ec2.internal:35559 (size: 47.4 KiB, free: 14.2 GiB) 24/02/06 05:04:07 INFO SparkContext: Created broadcast 48459 from take at /mnt/tmp/spark-7f0d7e5b-8199-405d-aadf-6da6fd1c5cd0/RES_PNR_prod_HRLI.py:1191 24/02/06 05:04:07 INFO MemoryStore: 2 blocks selected for dropping (468.0 KiB bytes) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13002_piece0 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13002_piece0 to disk 24/02/06 05:04:07 INFO BlockManagerInfo: Updated broadcast_13002_piece0 on disk on ip-10-156-17-116.ec2.internal:35559 (current size: 47.4 KiB, original size: 0.0 B) 24/02/06 05:04:07 INFO BlockManager: Dropping block broadcast_13003 from memory 24/02/06 05:04:07 INFO BlockManager: Writing block broadcast_13003 to disk 24/02/06 05:04:07 INFO MemoryStore: After dropping 2 blocks, free memory is 1342.0 KiB 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48460 stored as values in memory (estimated size 420.5 KiB, free 921.5 KiB) 24/02/06 05:04:07 INFO MemoryStore: Block broadcast_48460_piece0 stored as bytes in memory (estimated size 47.4 KiB, free 874.0 KiB) 24/02/06 05:04:07 INFO
Re: [PR] [HUDI-6902] Containerize the Azure CI 4th module [hudi]
codope commented on code in PR #10512: URL: https://github.com/apache/hudi/pull/10512#discussion_r1479200835 ## hudi-platform-service/hudi-metaserver/hudi-metaserver-server/pom.xml: ## @@ -92,6 +92,34 @@ + +thrift-gen-source-with-script Review Comment: Just for my understanding, why do we need a separate profile? Shouldn't the hudi-platform-service profile suffice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-9424]Support using local timezone when writing flink TIMESTAMP data [hudi]
danny0405 commented on PR #10594: URL: https://github.com/apache/hudi/pull/10594#issuecomment-1928733456 The travis tests still got falures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7338) Bump HBase, pulsar-client, and jetty version
[ https://issues.apache.org/jira/browse/HUDI-7338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7338. Fix Version/s: 1.0.0 Resolution: Fixed Fixed via master branch: c1d47014ca0430b2e2f4c2225767f2754a4fab2c > Bump HBase, pulsar-client, and jetty version > > > Key: HUDI-7338 > URL: https://issues.apache.org/jira/browse/HUDI-7338 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Shawn Chang >Assignee: Shawn Chang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > There is a major CVE spotted in jetty/netty: > [https://nvd.nist.gov/vuln/detail/CVE-2023-44487] > > Bumping the version can help mitigate the problem -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new c1d47014ca0 [HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223) c1d47014ca0 is described below commit c1d47014ca0430b2e2f4c2225767f2754a4fab2c Author: Shawn Chang <42792772+c...@users.noreply.github.com> AuthorDate: Mon Feb 5 19:43:50 2024 -0800 [HUDI-7338] Bump HBase, Pulsar, Jetty version (#10223) Co-authored-by: Shawn Chang --- pom.xml | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/pom.xml b/pom.xml index d9a87558939..3eeed340178 100644 --- a/pom.xml +++ b/pom.xml @@ -102,7 +102,7 @@ ${fasterxml.spark3.version} 2.0.0 2.8.0 -2.10.2 +3.0.2 ${pulsar.spark.scala12.version} 2.4.5 3.1.1.4 @@ -189,9 +189,9 @@ log4j2-surefire.properties 0.13.0 4.6.7 -9.4.48.v20220622 +9.4.53.v20231009 3.1.0-incubating -2.4.9 +2.4.13 1.4.199 3.1.2 false @@ -476,6 +476,7 @@ org.apache.hbase.thirdparty:hbase-shaded-miscellaneous org.apache.hbase.thirdparty:hbase-shaded-netty org.apache.hbase.thirdparty:hbase-shaded-protobuf + org.apache.hbase.thirdparty:hbase-unsafe org.apache.htrace:htrace-core4 com.fasterxml.jackson.module:jackson-module-afterburner
Re: [PR] [HUDI-7338] Upgrade Jetty, HBase, and pulsar-client [hudi]
danny0405 merged PR #10223: URL: https://github.com/apache/hudi/pull/10223 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Using MRO table and synchronizing to hive, Flink checkpoint failed, resulting in log files being unable to scroll to parquet files [hudi]
danny0405 commented on issue #10616: URL: https://github.com/apache/hudi/issues/10616#issuecomment-1928729633 Are you using append mode or upsert mode? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928656987 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22334) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7366] Fix HoodieLocation with encoded paths (#10602)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 590506752c1 [HUDI-7366] Fix HoodieLocation with encoded paths (#10602) 590506752c1 is described below commit 590506752c1034183906526c4c414e7500953f1b Author: Y Ethan Guo AuthorDate: Mon Feb 5 17:31:35 2024 -0800 [HUDI-7366] Fix HoodieLocation with encoded paths (#10602) --- .../main/java/org/apache/hudi/storage/HoodieLocation.java| 3 ++- .../java/org/apache/hudi/io/storage/TestHoodieLocation.java | 12 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java b/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java index 3b3a05dc9b4..2073548b7d1 100644 --- a/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java +++ b/hudi-io/src/main/java/org/apache/hudi/storage/HoodieLocation.java @@ -108,7 +108,8 @@ public class HoodieLocation implements Comparable, Serializable parentUri.getAuthority(), parentPathWithSeparator, null, - parentUri.getFragment()).resolve(normalizedChild); + parentUri.getFragment()) + .resolve(new URI(null, null, normalizedChild, null, null)); this.uri = new URI( parentUri.getScheme(), parentUri.getAuthority(), diff --git a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java index 4c765d2cc3f..7c3af8741ba 100644 --- a/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java +++ b/hudi-io/src/test/java/org/apache/hudi/io/storage/TestHoodieLocation.java @@ -115,6 +115,18 @@ public class TestHoodieLocation { new HoodieLocation(new HoodieLocation(new URI("foo://bar/baz#bud")), "/fud#boo").toString()); } + @Test + public void testEncoded() { +// encoded character like `%2F` should be kept as is +assertEquals(new HoodieLocation("s3://foo/bar/1%2F2%2F3"), new HoodieLocation("s3://foo/bar", "1%2F2%2F3")); +assertEquals("s3://foo/bar/1%2F2%2F3", new HoodieLocation("s3://foo/bar", "1%2F2%2F3").toString()); +assertEquals(new HoodieLocation("s3://foo/bar/1%2F2%2F3"), +new HoodieLocation(new HoodieLocation("s3://foo/bar"), "1%2F2%2F3")); +assertEquals("s3://foo/bar/1%2F2%2F3", +new HoodieLocation(new HoodieLocation("s3://foo/bar"), "1%2F2%2F3").toString()); +assertEquals("s3://foo/bar/1%2F2%2F3", new HoodieLocation("s3://foo/bar/1%2F2%2F3").toString()); + } + @Test public void testPathToUriConversion() throws URISyntaxException { assertEquals(new URI(null, null, "/foo?bar", null, null),
Re: [PR] [HUDI-7366] Fix HoodieLocation with encoded paths [hudi]
vinothchandar merged PR #10602: URL: https://github.com/apache/hudi/pull/10602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7357] Introduce generic StorageConfiguration [hudi]
vinothchandar commented on code in PR #10586: URL: https://github.com/apache/hudi/pull/10586#discussion_r1479114527 ## hudi-io/src/main/java/org/apache/hudi/storage/StorageConfiguration.java: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.storage; + +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; + +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.ObjectOutputStream; +import java.io.Serializable; + +/** + * Interface providing the storage configuration in type {@link T}. + * + * @param type of storage configuration to provide. + */ +public abstract class StorageConfiguration implements Serializable { + /** + * @return the storage configuration. + */ + public abstract T get(); + + /** + * @return a new copy of the storage configuration. + */ + public abstract T newCopy(); + + /** + * Serializes the storage configuration. + * DO NOT change the signature, as required by {@link Serializable}. + * + * @param out stream to write. + * @throws IOException on I/O error. + */ + public abstract void writeObject(ObjectOutputStream out) throws IOException; Review Comment: Does this need to be `ObjectOutputStream` due to `Serializable`, right? We should have ability to control the binary serialization of this object , lets make sure of that? ## hudi-hadoop-common/src/test/java/org/apache/hudi/storage/hadoop/TestHadoopStorageConfiguration.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.storage.hadoop; + +import org.apache.hudi.io.storage.TestStorageConfigurationBase; +import org.apache.hudi.storage.StorageConfiguration; + +import org.apache.hadoop.conf.Configuration; + +import java.util.Map; + +/** + * Tests {@link HadoopStorageConfiguration}. + */ +public class TestHadoopStorageConfiguration extends TestStorageConfigurationBase { Review Comment: again this is a test class. not a test by itself. fix naming? ## hudi-io/src/main/java/org/apache/hudi/storage/StorageConfiguration.java: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.storage; + +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; + +import java.io.IOException; +import java.io.ObjectInputStream; +import java.io.ObjectOutputStream; +import java.io.Serializable; + +/** + * Interface providing the storage configuration in type {@link T}. + * + * @param type of storage configuration to provide. + */ +public abstract
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
bhasudha commented on PR #10624: URL: https://github.com/apache/hudi/pull/10624#issuecomment-1928592401 Tested it locally, the diagrams may need to be reduced in size since they feel little disproportionate as compared to other pages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928546567 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330) * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22334) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928492419 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330) * 06c2064ab7a3087ae57f345253dd8ed0a9615c02 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928476735 ## CI report: * e39968e5155283e2c25a31626732a1cdde634840 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (ff0e67f78df -> c098ebaf166)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from ff0e67f78df [HUDI-7351] Implement partition pushdown for glue (#10604) add c098ebaf166 [HUDI-7375] Disable a flaky test method (#10627) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/common/functional/TestHoodieLogFormat.java | 2 ++ 1 file changed, 2 insertions(+)
Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]
yihua merged PR #10627: URL: https://github.com/apache/hudi/pull/10627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1928111329 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928100117 ## CI report: * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329) * e39968e5155283e2c25a31626732a1cdde634840 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]
hudi-bot commented on PR #10627: URL: https://github.com/apache/hudi/pull/10627#issuecomment-1928100175 ## CI report: * f86247ccd72b443975c8ab08b74300627641c5c8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22331) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] File not found while using metadata table for insert_overwrite table [hudi]
Shubham21k opened a new issue, #10628: URL: https://github.com/apache/hudi/issues/10628 We are incrementally writing to a hudi table with insert_overwrite operations. Recently, We enabled Hudi metadata table for these tables. However after few days we started to encounter the `FileNotFoundException` issue while reading these tables from athena (with metadata listing enabled). Upon further investigation, we observed that the metadata contains older files that were cleaned up by the cleaner and are no longer available. Steps to reproduce the behavior: 1. create a simple df and write to a hudi table incrementally with these properties ``` hoodie.datasource.meta.sync.enable=true hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool hoodie.write.markers.type=DIRECT **hoodie.metadata.enable=true hoodie.datasource.write.operation=insert_overwrite** hoodie.datasource.write.partitionpath.field=cs_load_hr hoodie.datasource.hive_sync.partition_fields=cs_load_hr partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd/HH hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor hoodie.parquet.compression.codec=snappy hoodie.table.services.enabled=true hoodie.rollback.using.markers=false hoodie.commits.archival.batch=30 hoodie.archive.delete.parallelism=500 hoodie.index.type=SIMPLE hoodie.clean.allow.multiple=false hoodie.clean.async=true hoodie.clean.automatic=true hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=3 hoodie.cleaner.parallelism=500 hoodie.cleaner.incremental.mode=true hoodie.clean.max.commits=8 hoodie.archive.async=true hoodie.archive.automatic=true hoodie.archive.merge.enable=true hoodie.archive.merge.files.batch.size=60 hoodie.keep.max.commits=10 hoodie.keep.min.commits=5 ``` df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath) 2. after few incremental writes, some of the base files should be updated & metadata does not get updated properly, it continues to persist old files pointer as well. 3. if you try reading the table using spark or athena, you will get FileNotFoundException keep in mind to enable metadata while reading. upon disabling the metadata listing on the read side, there is no error and reads work fine. 4. Note : We have observed this issue only for **insert_overwrite** operations. Upsert operation table's metadata gets updated correctly. **Expected behavior** It is expected that the hoodie metadata gets updated correctly. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.2.1 * Hive version : NA * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : **Additional context** The timeline also contains replaceCommits for corrupted tables. (which are not present in case of upsert table) ``` $ aws s3 ls s3://tmp-data/investments_ctr_tbl/.hoodie/ PRE .aux/ PRE archived/ PRE metadata/ 2023-12-08 13:32:17 0 .aux_$folder$ 2023-12-08 13:32:17 0 .schema_$folder$ 2023-12-08 13:32:17 0 .temp_$folder$ 2023-12-14 22:17:18 4678 20231214221641350.clean 2023-12-14 22:17:11 3227 20231214221641350.clean.inflight 2023-12-14 22:17:10 3227 20231214221641350.clean.requested 2023-12-22 21:50:54 4439 2023114849300.clean 2023-12-22 21:50:45 4337 2023114849300.clean.inflight 2023-12-22 21:50:45 4337 2023114849300.clean.requested 2023-12-30 21:51:16 4439 20231230214431936.clean 2023-12-30 21:51:07 4337 20231230214431936.clean.inflight 2023-12-30 21:51:07 4337 20231230214431936.clean.requested 2024-01-07 21:53:30 4439 20240107215204594.clean 2024-01-07 21:53:23 4337 20240107215204594.clean.inflight 2024-01-07 21:53:22 4337 20240107215204594.clean.requested 2024-01-15 21:55:00 4439 20240115215112126.clean 2024-01-15 21:54:52 4337 20240115215112126.clean.inflight 2024-01-15 21:54:52 4337 20240115215112126.clean.requested 2024-01-23 21:46:53 4439 20240123214442067.clean 2024-01-23 21:46:45 4337 20240123214442067.clean.inflight 2024-01-23 21:46:45 4337
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478878886 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. Review Comment: @dipankarmazumdar can you also fix all occurrences for File Group, File Slice, Base File, Log File, etc to align on the casing, indicating these are hudi specific terms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478877408 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format +![Table Format](/assets/images/blog/hudistack/table_format_1.png) +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table**
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format +![Table Format](/assets/images/blog/hudistack/table_format_1.png) +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table**
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format +![Table Format](/assets/images/blog/hudistack/table_format_1.png) +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table**
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478873515 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format +![Table Format](/assets/images/blog/hudistack/table_format_1.png) +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table**
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928036693 ## CI report: * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329) * e39968e5155283e2c25a31626732a1cdde634840 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22332) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478869049 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. + +# Transactional Database Layer +The transactional database layer of Hudi comprises the core components that are responsible for the fundamental operations and services that enable Hudi to store, retrieve, and manage data efficiently on data lakehouse storages. + +## Table Format +![Table Format](/assets/images/blog/hudistack/table_format_1.png) +_Figure: Apache Hudi's Table format_ + +Drawing an analogy to file formats, a table format simply comprises the file layout of the table, the schema, and metadata tracking changes. Hudi organizes files within a table or partition into File Groups. Updates are captured in log files tied to these File Groups, ensuring efficient merges. There are three major components related to Hudi’s table format. + +- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), stored in the /.hoodie folder, is a crucial event log recording all table actions in an ordered manner, with events kept for a specified period. Hudi uniquely designs each file group as a self-contained log, enabling record state reconstruction through delta logs, even after archival of related actions. This approach effectively limits metadata size based on table activity frequency, essential for managing tables with frequent updates. + +- **File Group and File Slice** : Within each partition the data is physically stored as base and log files and organized into logical concepts as [File groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File slices. File groups contain multiple versions of file slices and are split into multiple file slices. A file slice comprises the base and log file. Each file slice within the file-group is uniquely identified by the commit's timestamp that created it. + +- **Metadata Table**
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1928026486 ## CI report: * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329) * e39968e5155283e2c25a31626732a1cdde634840 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
xushiyan commented on code in PR #10624: URL: https://github.com/apache/hudi/pull/10624#discussion_r1478861210 ## website/docs/hudi_stack.md: ## @@ -0,0 +1,99 @@ +--- +title: Apache Hudi Stack +summary: "Explains about the various layers of software components that make up Hudi" +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 +last_modified_at: +--- + +Apache Hudi is a Transactional Data Lakehouse Platform built around a database kernel. It brings core warehouse and database functionality directly to a data lake thereby providing a table-level abstraction over open file formats like Apache Parquet/ORC (more recently known as the lakehouse architecture) and enabling transactional capabilities such as updates/deletes. Hudi also incorporates essential table services that are tightly integrated with the database kernel. These services can be executed automatically across both ingested and derived data to manage various aspects such as table bookkeeping, metadata, and storage layout. This integration along with various platform-specific services extends Hudi's role from being just a 'table format' to a comprehensive and robust data lakehouse platform. + +In this section, we will explore the Hudi stack and deconstruct the layers of software components that constitute Hudi. The features marked with an asterisk (*) represent work in progress, and the dotted boxes indicate planned future work. These components collectively aim to fulfill the [vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for the project. + +![Hudi Stack](/assets/images/blog/hudistack/hstck.png) +_Figure: Apache Hudi Architectural stack_ + +# Lake Storage +The storage layer is where the data files (such as Parquet) are stored. Hudi interacts with the storage layer through the [Hadoop FileSystem API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html), enabling compatibility with various systems including HDFS for fast appends, and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can rely on Hadoop-independent file system implementation to simplify the integration of various file systems. Hudi adds a custom wrapper filesystem that lays out the foundation for improved storage optimizations. + +# File Formats +![File Format](/assets/images/blog/hudistack/file_format.png) +_Figure: File format structure in Hudi_ + +File formats hold the raw data and are physically stored on the lake storage. Hudi operates on a 'base file and log file' structure. The base files are compacted and optimized for reads and are augmented with log files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., JSON, images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a log file as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. Review Comment: ```suggestion File formats hold the raw data and are physically stored on the lake storage. Hudi operates on logical structures of File Groups and File Slices, which consist of Base File and Log Files. Base Files are compacted and optimized for reads and are augmented with Log Files for efficient append. Future updates aim to integrate diverse formats like unstructured data (e.g., images), and compatibility with different storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a Log File as a sequence of blocks (data, delete, rollback). By making data available in open file formats (such as Parquet), Hudi enables users to bring any compute engine for specific workloads. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927870267 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 24839296069f8b228f31e7000c77a4630913dc07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318) * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22330) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception
[ https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-7360: - Priority: Blocker (was: Critical) > Incremental CDC Query after 0.14.1 upgrade giving Jackson class > incompatibility exception > - > > Key: HUDI-7360 > URL: https://issues.apache.org/jira/browse/HUDI-7360 > Project: Apache Hudi > Issue Type: Bug > Components: incremental-query, reader-core >Reporter: Aditya Goenka >Priority: Blocker > Fix For: 1.1.0 > > > Github Issue - [https://github.com/apache/hudi/issues/10590] > Reproducible code > ``` > from typing import Any > from pyspark import Row > from pyspark.sql import SparkSession > from pyspark.sql.functions import col > spark = SparkSession.builder \ > .appName("Hudi Basics") \ > .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.jars.packages", > "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \ > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \ > .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \ > .getOrCreate() > sc = spark.sparkContext > table_name = "hudi_trips_cdc" > base_path = "/tmp/test_issue_10590_4" # Replace for whatever path > quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils > dataGen = quickstart_utils.DataGenerator() > inserts = > sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) > def create_df(): > df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > return df > def write_data(): > df = create_df() > hudi_options = { > "hoodie.table.name": table_name, > "hoodie.datasource.write.recordkey.field": "uuid", > "hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either > MoR or CoW and the error will still happen > "hoodie.datasource.write.partitionpath.field": "partitionpath", > "hoodie.datasource.write.table.name": table_name, > "hoodie.datasource.write.operation": "upsert", > "hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t > affect anything unless actually queried as CDC > "hoodie.datasource.write.precombine.field": "ts", > "hoodie.upsert.shuffle.parallelism": 2, > "hoodie.insert.shuffle.parallelism": 2 > } > df.write.format("hudi") \ > .options(**hudi_options) \ > .mode("overwrite") \ > .save(base_path) > def update_data(): > updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10)) > df = spark.read.json(spark.sparkContext.parallelize(updates, 2)) > df.write \ > .format("hudi") \ > .mode("append") \ > .save(base_path) > def incremental_query(): > ordered_rows: list[Row] = spark.read \ > .format("hudi") \ > .load(base_path) \ > .select(col("_hoodie_commit_time").alias("commit_time")) \ > .orderBy(col("commit_time")) \ > .collect() > commits: list[Any] = list(map(lambda row: row[0], ordered_rows)) > begin_time = commits[0] > incremental_read_options = { > 'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to > Query as CDC, crashes in 0.14.1 > 'hoodie.datasource.query.type': 'incremental', > 'hoodie.datasource.read.begin.instanttime': begin_time, > } > trips_incremental_df = spark.read \ > .format("hudi") \ > .options(**incremental_read_options) \ > .load(base_path) > # Error also occurs when using the "from_hudi_table_changes" in 0.14.1 > # sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', > 'earliest')""" > # trips_incremental_df = spark.sql(sql_query) > trips_incremental_df.show() > trips_incremental_df.printSchema() > if __name__ == "__main__": > write_data() > update_data() > incremental_query() > ``` > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7360) Incremental CDC Query after 0.14.1 upgrade giving Jackson class incompatibility exception
[ https://issues.apache.org/jira/browse/HUDI-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-7360: - Component/s: incremental-query > Incremental CDC Query after 0.14.1 upgrade giving Jackson class > incompatibility exception > - > > Key: HUDI-7360 > URL: https://issues.apache.org/jira/browse/HUDI-7360 > Project: Apache Hudi > Issue Type: Bug > Components: incremental-query, reader-core >Reporter: Aditya Goenka >Priority: Critical > Fix For: 1.1.0 > > > Github Issue - [https://github.com/apache/hudi/issues/10590] > Reproducible code > ``` > from typing import Any > from pyspark import Row > from pyspark.sql import SparkSession > from pyspark.sql.functions import col > spark = SparkSession.builder \ > .appName("Hudi Basics") \ > .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .config("spark.jars.packages", > "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.1") \ > .config("spark.sql.extensions", > "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \ > .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \ > .getOrCreate() > sc = spark.sparkContext > table_name = "hudi_trips_cdc" > base_path = "/tmp/test_issue_10590_4" # Replace for whatever path > quickstart_utils = sc._jvm.org.apache.hudi.QuickstartUtils > dataGen = quickstart_utils.DataGenerator() > inserts = > sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10)) > def create_df(): > df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > return df > def write_data(): > df = create_df() > hudi_options = { > "hoodie.table.name": table_name, > "hoodie.datasource.write.recordkey.field": "uuid", > "hoodie.datasource.write.table.type": "MERGE_ON_READ", # This can be either > MoR or CoW and the error will still happen > "hoodie.datasource.write.partitionpath.field": "partitionpath", > "hoodie.datasource.write.table.name": table_name, > "hoodie.datasource.write.operation": "upsert", > "hoodie.table.cdc.enabled": "true", # This can be left enabled, and won"t > affect anything unless actually queried as CDC > "hoodie.datasource.write.precombine.field": "ts", > "hoodie.upsert.shuffle.parallelism": 2, > "hoodie.insert.shuffle.parallelism": 2 > } > df.write.format("hudi") \ > .options(**hudi_options) \ > .mode("overwrite") \ > .save(base_path) > def update_data(): > updates = quickstart_utils.convertToStringList(dataGen.generateUpdates(10)) > df = spark.read.json(spark.sparkContext.parallelize(updates, 2)) > df.write \ > .format("hudi") \ > .mode("append") \ > .save(base_path) > def incremental_query(): > ordered_rows: list[Row] = spark.read \ > .format("hudi") \ > .load(base_path) \ > .select(col("_hoodie_commit_time").alias("commit_time")) \ > .orderBy(col("commit_time")) \ > .collect() > commits: list[Any] = list(map(lambda row: row[0], ordered_rows)) > begin_time = commits[0] > incremental_read_options = { > 'hoodie.datasource.query.incremental.format': "cdc", # Uncomment this line to > Query as CDC, crashes in 0.14.1 > 'hoodie.datasource.query.type': 'incremental', > 'hoodie.datasource.read.begin.instanttime': begin_time, > } > trips_incremental_df = spark.read \ > .format("hudi") \ > .options(**incremental_read_options) \ > .load(base_path) > # Error also occurs when using the "from_hudi_table_changes" in 0.14.1 > # sql_query = f""" SELECT * FROM hudi_table_changes ('\{base_path}', 'cdc', > 'earliest')""" > # trips_incremental_df = spark.sql(sql_query) > trips_incremental_df.show() > trips_incremental_df.printSchema() > if __name__ == "__main__": > write_data() > update_data() > incremental_query() > ``` > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7375] Disable a test method failure caused by MiniHdfs [hudi]
hudi-bot commented on PR #10627: URL: https://github.com/apache/hudi/pull/10627#issuecomment-1927859226 ## CI report: * f86247ccd72b443975c8ab08b74300627641c5c8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22331) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927859155 ## CI report: * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927858601 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 24839296069f8b228f31e7000c77a4630913dc07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318) * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN * a35d70e6bd5a2a1fe5fcbf032e536b98fbb197ae UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7375] Disable a flaky test case [hudi]
hudi-bot commented on PR #10627: URL: https://github.com/apache/hudi/pull/10627#issuecomment-1927736445 ## CI report: * f86247ccd72b443975c8ab08b74300627641c5c8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927736186 ## CI report: * e69065c1325a38735b053108f72341db0cd31da9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22329) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927701138 ## CI report: * e69065c1325a38735b053108f72341db0cd31da9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) * 2d0152f41f27aa4acb9c47bbf9061f7726e49fa7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1927700598 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 24839296069f8b228f31e7000c77a4630913dc07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318) * 8b190fe2b8bf40f56e4033aaeb3889fe4f03b75a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7375) Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks
[ https://issues.apache.org/jira/browse/HUDI-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7375: - Labels: pull-request-available (was: ) > Fix flaky test: testLogReaderWithDifferentVersionsOfDeleteBlocks > > > Key: HUDI-7375 > URL: https://issues.apache.org/jira/browse/HUDI-7375 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > {code:java} > Error: testLogReaderWithDifferentVersionsOfDeleteBlocks{DiskMapType, > boolean, boolean, boolean}[13] Time elapsed: 0.043 s <<< ERROR! > 3421org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /user/root/[13] BITCASK, false, true, > false1706913234251/partition_path/.test-fileid1_100.log.1_1-0-1 could only be > written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running > and 3 node(s) are excluded in this operation. > 3422 at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2338) > 3423 at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) > 3424 at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2989) > 3425 at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:911) > 3426 at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595) > 3427 at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > 3428 at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) > 3429 at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > 3430 at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > 3431 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) > 3432 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) > 3433 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) > 3434 at java.security.AccessController.doPrivileged(Native Method) > 3435 at javax.security.auth.Subject.doAs(Subject.java:422) > 3436 at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > 3437 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) > 3438 > 3439 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612) > 3440 at org.apache.hadoop.ipc.Client.call(Client.java:1558) > 3441 at org.apache.hadoop.ipc.Client.call(Client.java:1455) > 3442 at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) > 3443 at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) > 3444 at jdk.proxy2/jdk.proxy2.$Proxy43.addBlock(Unknown Source) > 3445 at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:530) > 3446 at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > 3447 at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 3448 at java.base/java.lang.reflect.Method.invoke(Method.java:568) > 3449 at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > 3450 at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > 3451 at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > 3452 at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > 3453 at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > 3454 at jdk.proxy2/jdk.proxy2.$Proxy44.addBlock(Unknown Source) > 3455 at > org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088) > 3456 at > org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1915) > 3457 at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1717) > 3458 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713) > 3459 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7375] Disable a flaky test method [hudi]
linliu-code opened a new pull request, #10627: URL: https://github.com/apache/hudi/pull/10627 ### Change Logs Which is caused by issues from underlying MiniHDFS. We should target to fix the root cause; disable the method for now. ### Impact Unblock CI tests. ### Risk level (write none, low medium or high below) Low. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927681286 ## CI report: * e69065c1325a38735b053108f72341db0cd31da9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927531831 ## CI report: * 264059fcce703e1bde6c07bdce6ee106fcff30a6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22325) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]
yihua commented on code in PR #10619: URL: https://github.com/apache/hudi/pull/10619#discussion_r1478603403 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java: ## @@ -239,18 +240,25 @@ public List compact(HoodieCompactionHandler compactionHandler, scanner.close(); Iterable> resultIterable = () -> result; return StreamSupport.stream(resultIterable.spliterator(), false).flatMap(Collection::stream).peek(s -> { - s.getStat().setTotalUpdatedRecordsCompacted(scanner.getNumMergedRecordsInLog()); - s.getStat().setTotalLogFilesCompacted(scanner.getTotalLogFiles()); - s.getStat().setTotalLogRecords(scanner.getTotalLogRecords()); - s.getStat().setPartitionPath(operation.getPartitionPath()); - s.getStat() + final HoodieWriteStat stat = s.getStat(); + stat.setTotalUpdatedRecordsCompacted(scanner.getNumMergedRecordsInLog()); + stat.setTotalLogFilesCompacted(scanner.getTotalLogFiles()); + stat.setTotalLogRecords(scanner.getTotalLogRecords()); + stat.setPartitionPath(operation.getPartitionPath()); + stat .setTotalLogSizeCompacted(operation.getMetrics().get(CompactionStrategy.TOTAL_LOG_FILE_SIZE).longValue()); - s.getStat().setTotalLogBlocks(scanner.getTotalLogBlocks()); - s.getStat().setTotalCorruptLogBlock(scanner.getTotalCorruptBlocks()); - s.getStat().setTotalRollbackBlocks(scanner.getTotalRollbacks()); + stat.setTotalLogBlocks(scanner.getTotalLogBlocks()); + stat.setTotalCorruptLogBlock(scanner.getTotalCorruptBlocks()); + stat.setTotalRollbackBlocks(scanner.getTotalRollbacks()); RuntimeStats runtimeStats = new RuntimeStats(); + // scan time has to be obtained from scanner. runtimeStats.setTotalScanTime(scanner.getTotalTimeTakenToReadAndMergeBlocks()); - s.getStat().setRuntimeStats(runtimeStats); + // create and upsert time are obtained from the create or merge handle. + if (stat.getRuntimeStats() != null) { + runtimeStats.setTotalCreateTime(stat.getRuntimeStats().getTotalCreateTime()); + runtimeStats.setTotalUpsertTime(stat.getRuntimeStats().getTotalUpsertTime()); Review Comment: Can we add a unit test around the runtime stats? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927504960 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323) * e69065c1325a38735b053108f72341db0cd31da9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) * 36b0460dcb4c7ecc69d79d92befaab358a068d4e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22326) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Inconsistency in Hudi Table Configuration between Initial Insert and Subsequent Merges [hudi]
prashant462 opened a new issue, #10626: URL: https://github.com/apache/hudi/issues/10626 ### Issue Summary When using dbt Spark with Hudi to create a Hudi format table, there is an inconsistency in the Hudi table configuration between the initial insert and subsequent merge operations. The properties provided in the options of the dbt model are correctly fetched and applied during the first run. However, during the second run, when executing the merge operation, Hudi fetches a subset of the properties from the Hudi catalog table, leading to the addition of default properties and changes in configuration. ### Steps to Reproduce - Execute the dbt model with Hudi options for the initial insert. Sample model {{ config( materialized = 'incremental', file_format= 'hudi', pre_hook="SET spark.sql.legacy.allowNonEmptyLocationInCTAS = true", location_root="file:///Users/B0279627/Downloads/Hudi", unique_key="id", incremental_strategy="merge", options={ 'preCombineField': 'id2', 'hoodie.index.type':"GLOBAL_SIMPLE", 'hoodie.simple.index.update.partition.path':'true', 'hoodie.keep.min.commits':'145', 'hoodie.keep.max.commits':'288', 'hoodie.cleaner.policy':'KEEP_LATEST_BY_HOURS', 'hoodie.cleaner.hours.retained':'72', 'hoodie.cleaner.fileversions.retained':'144', 'hoodie.cleaner.commits.retained':'144', 'hoodie.upsert.shuffle.parallelism':'200', 'hoodie.insert.shuffle.parallelism':'200', 'hoodie.bulkinsert.shuffle.parallelism':'200', 'hoodie.delete.shuffle.parallelism':'200', 'hoodie.parquet.compression.codec':'zstd', 'hoodie.datasource.hive_sync.support_timestamp':'true', 'hoodie.datasource.write.reconcile.schema':'true', 'hoodie.enable.data.skipping':'true', 'hoodie.datasource.write.payload.class':'org.apache.hudi.common.model.DefaultHoodieRecordPayload', } ) }} - Observe that all specified properties are correctly applied during the first run. - For observation you can check with sample property like hoodie.index.type=GLOBAL_SIMPLE - Execute the dbt model with Hudi options for a subsequent merge operation. - Observe changes in Hudi table properties, with defaults being applied for certain configurations like hoodie.index.type changed to SIMPLE (Target table created seems like following hoodie.index.type= SIMPLE) ### Expected Behavior Hudi should consistently set all specified properties in every run, irrespective of whether it is the initial insert or a subsequent merge operation. The properties passed in the options of the dbt model should be retained and applied consistently across all operations. ### Environment Description * Hudi version : 0.12.1 * Spark version : 3.3.1 * Hive version : 3.1.3 * Hadoop version : 3.1.1 * DBT version: 1.7.1 * Storage (HDFS/S3/GCS..) : Checked with s3 , hdfs and local file system. * Running on Docker? (yes/no) : no ### **Additional context** In the second run MergeIntohoodieTableCommand.scala executes InsertIntoHoodieTableCommand.run() in this case hudi fetch the props from hudicatalog table where it fetches tableConfigs and catalog properties. But they are not all that complete properties which I passed in the first run using dbt options. Due to which hudi add some other default properties in the hoodie props which are not fetched in the hudicatalog props . Seems due to this many properties are changing. Below i have attached some images of properties fetched in subsequent merge operations https://github.com/apache/hudi/assets/31952894/46126281-b95a-47a4-9116-66a093a97506;> https://github.com/apache/hudi/assets/31952894/80ba4206-77d0-4852-aaf1-fd0e19c91025;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927384615 ## CI report: * 264059fcce703e1bde6c07bdce6ee106fcff30a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22325) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927384527 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323) * e69065c1325a38735b053108f72341db0cd31da9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) * 36b0460dcb4c7ecc69d79d92befaab358a068d4e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
hudi-bot commented on PR #10625: URL: https://github.com/apache/hudi/pull/10625#issuecomment-1927368985 ## CI report: * 264059fcce703e1bde6c07bdce6ee106fcff30a6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927353626 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323) * e69065c1325a38735b053108f72341db0cd31da9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22324) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7384) Implement writer path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7384: - Labels: pull-request-available (was: ) > Implement writer path support for secondary index > - > > Key: HUDI-7384 > URL: https://issues.apache.org/jira/browse/HUDI-7384 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > > # Basic initialization ona. existing table > # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7384][secondary-index][write] build secondary index on the keys… [hudi]
bhat-vinay opened a new pull request, #10625: URL: https://github.com/apache/hudi/pull/10625 … defined through options Initial commit. Supports the following features: 1. Modify schema to ass secondary index to metadata 2. New partition type in the metadata table to store secondary_keys-to-record_keys mapping 3. Various options to support secondary index enablement, column mappings (for secondary keys) etc 4. Initialization of secondary keys 5. Update secondary keys on inserts/upserts Supports only one secondary index at the moment. The PR is still a WIP and needs more work to handle deletions, proper merging, compaction, (re) clustering among other things. ### Change Logs Initial commit. Supports the following features: 1. Modify schema to ass secondary index to metadata 2. New partition type in the metadata table to store secondary_keys-to-record_keys mapping 3. Various options to support secondary index enablement, column mappings (for secondary keys) etc 4. Initialization of secondary keys 5. Update secondary keys on inserts/upserts Supports only one secondary index at the moment. The PR is still a WIP and needs more work to handle deletions, proper merging, compaction, (re) clustering among other things. ### Impact Support secondary index on columns (similar to record index, but for non-unique columns) ### Risk level (write none, low medium or high below) Medium. New and existing tests ### Documentation Update NA. Will be done later ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7384) Implement writer path support for secondary index
Vinaykumar Bhat created HUDI-7384: - Summary: Implement writer path support for secondary index Key: HUDI-7384 URL: https://issues.apache.org/jira/browse/HUDI-7384 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinaykumar Bhat # Basic initialization ona. existing table # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7384) Implement writer path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7384: - Assignee: Vinaykumar Bhat > Implement writer path support for secondary index > - > > Key: HUDI-7384 > URL: https://issues.apache.org/jira/browse/HUDI-7384 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > # Basic initialization ona. existing table > # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7383) CDC query failed due to dependency issue
Raymond Xu created HUDI-7383: Summary: CDC query failed due to dependency issue Key: HUDI-7383 URL: https://issues.apache.org/jira/browse/HUDI-7383 Project: Apache Hudi Issue Type: Bug Components: incremental-query Affects Versions: 0.14.1, 0.14.0 Reporter: Raymond Xu {code:java} spark-sql (default)> select count(*) from hudi_table_changes('tbl', 'cdc', '20240205084624923', '20240205091637412'); 24/02/05 09:47:46 WARN TaskSetManager: Lost task 10.0 in stage 28.0 (TID 1515) (ip-10-0-117-21.us-west-2.compute.internal executor 3): java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.(HoodieCDCRDD.scala:237) at org.apache.hudi.cdc.HoodieCDCRDD.compute(HoodieCDCRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:563) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:566) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: org.apache.hudi.com.fasterxml.jackson.module.scala.DefaultScalaModule$ at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 21 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927227972 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323) * e69065c1325a38735b053108f72341db0cd31da9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]
dipankarmazumdar opened a new pull request, #10624: URL: https://github.com/apache/hudi/pull/10624 ### Change Logs This PR adds a new page to the Hudi documentation called 'Apache Hudi Stack' ### Impact Adds a new page for clarity around Hudi's platform & architecture ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ Update is for documentation ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927134709 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22323) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
hudi-bot commented on PR #10623: URL: https://github.com/apache/hudi/pull/10623#issuecomment-1927118936 ## CI report: * f79a88bd2e0b6bff0c2657a6632ea8884e9af866 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
hudi-bot commented on PR #10621: URL: https://github.com/apache/hudi/pull/10621#issuecomment-1927104255 ## CI report: * 03e73542a48058577ff24fa42a6aebc1d4e2991e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22322) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink Table planner not loading problem [hudi]
vkhoroshko commented on issue #8265: URL: https://github.com/apache/hudi/issues/8265#issuecomment-1927062770 Hello, Is there any solution for this. I'm running Flink SQL client locally and it has flink-table-planner-loader-1.17.1.jar in the /opt/flink/lib folder (I'm using Docker). However, if Async Clustering is enabled I receive the same error as above: ```java.lang.ClassNotFoundException: org.apache.flink.table.planner.codegen.sort.SortCodeGenerator``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7351] Handle case when glue expression larger than 2048 limit [hudi]
parisni opened a new pull request, #10623: URL: https://github.com/apache/hudi/pull/10623 ### Change Logs After few days in production it turns out glue has a hard limit of expression (= 2048 chars). This patch handle this case, by fallback to returning all existing partitions. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] The Schema Evolution Not working For Hudi 0.12.3 [hudi]
Amar1404 commented on issue #10309: URL: https://github.com/apache/hudi/issues/10309#issuecomment-1926993882 hi @ad1happy2go - In my case the table is in long and changed to double. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Apache Hudi Auto-Size During Writes is not Working for Flink SQL [hudi]
vkhoroshko opened a new issue, #10622: URL: https://github.com/apache/hudi/issues/10622 **To Reproduce** Steps to reproduce the behavior: 1. Use Flink SQL with the file below. **Current behavior** A separate parquet file is produced with every Flink commit (during checkpointing) **Expected behavior** Data is appended to existing parquet file(s) until max size threshold is met. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * 0.14.1 * Flink version : 1.17.1 * Storage (HDFS/S3/GCS..) : File System * Running on Docker? (yes/no) : yes **Additional context** The expectation (as depicted in Apache Hudi docs - https://hudi.apache.org/docs/file_sizing#auto-sizing-during-writes) is that with every flink commit (every minute) - a set of records will be accumulated and written to one of existing parquet files until parquet file max size threshold is met (in the example below is 5MB). However, what happens is that every commit results in a separate parquet file (~400KB size) which are accumulated and are never merged. Please, help. SQL file: ``` SET 'parallelism.default' = '1'; SET 'execution.checkpointing.interval' = '1m'; CREATE TABLE datagen ( id INT NOT NULL PRIMARY KEY NOT ENFORCED, data STRING ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '5' ); CREATE TABLE hudi_tbl ( id INT NOT NULL PRIMARY KEY NOT ENFORCED, data STRING ) WITH ( 'connector' = 'hudi', 'path' = 'file:///opt/hudi', 'table.type' = 'COPY_ON_WRITE', 'write.parquet.block.size' = '1', 'write.operation' = 'insert', 'write.parquet.max.file.size' = '5' ); INSERT INTO hudi_tbl SELECT * from datagen; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]
VitoMakarevich commented on issue #10590: URL: https://github.com/apache/hudi/issues/10590#issuecomment-1926893180 The same happens with streaming source - since `HoodieSourceOffset` has `import com.fasterxml.jackson.module.scala.DefaultScalaModule`. As for 0.14.1 bundle - it has only `com.fasterxml.jackson.module.afterburner` from jackson. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
hudi-bot commented on PR #10621: URL: https://github.com/apache/hudi/pull/10621#issuecomment-1926762566 ## CI report: * 03e73542a48058577ff24fa42a6aebc1d4e2991e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22322) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [WIP] [HUDI-5823] [RFC-65] Update to the Partition TTL RFC [hudi]
geserdugarov closed pull request #10248: [WIP] [HUDI-5823] [RFC-65] Update to the Partition TTL RFC URL: https://github.com/apache/hudi/pull/10248 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
hudi-bot commented on PR #10621: URL: https://github.com/apache/hudi/pull/10621#issuecomment-1926750611 ## CI report: * 03e73542a48058577ff24fa42a6aebc1d4e2991e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]
hudi-bot commented on PR #10618: URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926738703 ## CI report: * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22320) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]
hudi-bot commented on PR #10619: URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926738761 ## CI report: * 9945ee19750336801b3b816710234deabfce3b63 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22321) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]
PrabhuJoseph commented on PR #10618: URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926722425 @danny0405 Could you review this patch when you get time. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
fhan688 opened a new pull request, #10621: URL: https://github.com/apache/hudi/pull/10621 ### Change Logs Get partitions from active timeline instead of listing when building clustering plan ### Impact New strategy to build clustering plan for Flink ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
fhan688 closed pull request #10620: [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan URL: https://github.com/apache/hudi/pull/10620 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7382) Get partitions from active timeline instead of listing when building clustering plan
[ https://issues.apache.org/jira/browse/HUDI-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7382: - Labels: pull-request-available (was: ) > Get partitions from active timeline instead of listing when building > clustering plan > > > Key: HUDI-7382 > URL: https://issues.apache.org/jira/browse/HUDI-7382 > Project: Apache Hudi > Issue Type: New Feature >Reporter: fhan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7382] Get partitions from active timeline instead of listing when building clustering plan [hudi]
fhan688 opened a new pull request, #10620: URL: https://github.com/apache/hudi/pull/10620 ### Change Logs Get partitions from active timeline instead of listing when building clustering plan ### Impact New strategy to build clustering plan for flink ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7382) Get partitions from active timeline instead of listing when building clustering plan
fhan created HUDI-7382: -- Summary: Get partitions from active timeline instead of listing when building clustering plan Key: HUDI-7382 URL: https://issues.apache.org/jira/browse/HUDI-7382 Project: Apache Hudi Issue Type: New Feature Reporter: fhan -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]
maheshguptags commented on issue #10609: URL: https://github.com/apache/hudi/issues/10609#issuecomment-1926535894 @ad1happy2go I tried without RLI, it is working fine. however, when I add the `RLI` index to the table, it starts failing. I am not sure why RLi is causing errors while without any index it is working fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7146] Handle duplicate keys in HFile [hudi]
hudi-bot commented on PR #10617: URL: https://github.com/apache/hudi/pull/10617#issuecomment-1926524438 ## CI report: * 8f4c9339886d7a863faa59c30bb8047df9b89ad3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22319) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Containerize the Azure tests [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1926523966 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN * a5529adc60d4af0c3ece9bbcdcc98ecd5482d21a UNKNOWN * b13310f2241a287a1966fe7fd63a616b86c3974c UNKNOWN * d47977a291de7374cc34436f4c4e22e1812a883e UNKNOWN * e0931770db4a4846a16b09eace9154166bd0842d UNKNOWN * f8c748241017499433296ff26e6984064d8085b8 UNKNOWN * 24839296069f8b228f31e7000c77a4630913dc07 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22318) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]
hudi-bot commented on PR #10619: URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926450038 ## CI report: * 9945ee19750336801b3b816710234deabfce3b63 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22321) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]
hudi-bot commented on PR #10618: URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926449973 ## CI report: * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22320) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7381] Fix compaction write stats and metrics for create and upsert time [hudi]
hudi-bot commented on PR #10619: URL: https://github.com/apache/hudi/pull/10619#issuecomment-1926439711 ## CI report: * 9945ee19750336801b3b816710234deabfce3b63 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7379] Exclude jackson-module-afterburner from hudi-aws module [hudi]
hudi-bot commented on PR #10618: URL: https://github.com/apache/hudi/pull/10618#issuecomment-1926439656 ## CI report: * 5576069d77bdb9202c83627c2f0b93a9ae7ed208 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org