Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993682626 ## CI report: * 7bce9399d616a570e8a04c783b06e7e2f404dc5a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22886) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7492] fix the issue of incorrect keygenerator specification when creating m… [hudi]
hudi-bot commented on PR #10840: URL: https://github.com/apache/hudi/pull/10840#issuecomment-1993682540 ## CI report: * cf41aa0ce79b39dc6f09db500db4b123fed34ff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22877) * c6d233de457320d91579376f3d4669ee4dcf8f50 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22887) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7492] fix the issue of incorrect keygenerator specification when creating m… [hudi]
hudi-bot commented on PR #10840: URL: https://github.com/apache/hudi/pull/10840#issuecomment-1993675304 ## CI report: * cf41aa0ce79b39dc6f09db500db4b123fed34ff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22877) * c6d233de457320d91579376f3d4669ee4dcf8f50 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 [hudi]
codope closed issue #9119: [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 URL: https://github.com/apache/hudi/issues/9119 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 [hudi]
ad1happy2go commented on issue #9119: URL: https://github.com/apache/hudi/issues/9119#issuecomment-1993660645 Closing this as this was Fixed via: https://github.com/apache/hudi/pull/9984 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993620652 ## CI report: * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884) * 7bce9399d616a570e8a04c783b06e7e2f404dc5a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22886) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993614748 ## CI report: * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884) * 7bce9399d616a570e8a04c783b06e7e2f404dc5a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993528107 > Old key name `hoodie.clean.automatic` in `TestCDCDataFrameSuite.testCOWDataSourceWrite` doesn't work. Search for the reason. The reason is that in `HoodieCDCTestBase` `HoodieCleanConfig.AUTO_CLEAN.key -> "false"` is added to `commonOpts`. And if we add in `TestCDCDataFrameSuite.testCOWDataSourceWrite` by `.option("hoodie.clean.automatic", "true")` to the same `commonOpts` then we have two different value for a single parameter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993363802 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * 948d9ecb6dc661628b787ba800756b78d52791af Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993226243 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * 998a987f33866407fd1b2d8350e6c2f2386f59ad Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22882) * 948d9ecb6dc661628b787ba800756b78d52791af Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
hudi-bot commented on PR #10845: URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993208787 ## CI report: * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN * 998a987f33866407fd1b2d8350e6c2f2386f59ad Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22882) * 948d9ecb6dc661628b787ba800756b78d52791af UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r152204 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; + +import java.util.List; +import java.util.Objects; +import java.util.function.Function; + +/** + * A global timeline view with both active and archived timeline involved. + */ +public class HoodieGlobalTimeline extends HoodieDefaultTimeline { + private static final long serialVersionUID = 1L; + private static final Logger LOG = LoggerFactory.getLogger(HoodieGlobalTimeline.class); + private final HoodieTableMetaClient metaClient; + private final HoodieActiveTimeline activeTimeline; + private final HoodieArchivedTimeline archivedTimeline; + + protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, Option startInstant) { +this.metaClient = metaClient; +this.activeTimeline = new HoodieActiveTimeline(metaClient); +archivedTimeline = startInstant.isPresent() ? new HoodieArchivedTimeline(metaClient, startInstant.get()) : new HoodieArchivedTimeline(metaClient); +this.details = FederatedDetails.create(this.activeTimeline, archivedTimeline); +setInstants(mergeInstants(archivedTimeline.getInstants(), activeTimeline.getInstants())); Review Comment: I kind of make the archived timeline loading smartly based on the given `startTs`, if we also wanna lazy loading for specific API invocation level, that might needs a refactoring of all the timelines to support lazy setup of the instants. Would prefer to to that in another PR if we reach consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1522287754 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; + +import java.util.List; +import java.util.Objects; +import java.util.function.Function; + +/** + * A global timeline view with both active and archived timeline involved. + */ +public class HoodieGlobalTimeline extends HoodieDefaultTimeline { + private static final long serialVersionUID = 1L; + private static final Logger LOG = LoggerFactory.getLogger(HoodieGlobalTimeline.class); + private final HoodieTableMetaClient metaClient; + private final HoodieActiveTimeline activeTimeline; + private final HoodieArchivedTimeline archivedTimeline; + + protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, Option startInstant) { +this.metaClient = metaClient; +this.activeTimeline = new HoodieActiveTimeline(metaClient); +archivedTimeline = startInstant.isPresent() ? new HoodieArchivedTimeline(metaClient, startInstant.get()) : new HoodieArchivedTimeline(metaClient); +this.details = FederatedDetails.create(this.activeTimeline, archivedTimeline); +setInstants(mergeInstants(archivedTimeline.getInstants(), activeTimeline.getInstants())); + } + + protected HoodieGlobalTimeline(HoodieActiveTimeline activeTimeline, HoodieArchivedTimeline archivedTimeline) { +this.metaClient = activeTimeline.metaClient; +this.activeTimeline = activeTimeline; +this.archivedTimeline = archivedTimeline; +this.details = FederatedDetails.create(this.activeTimeline, archivedTimeline); +setInstants(mergeInstants(archivedTimeline.getInstants(), activeTimeline.getInstants())); + } + + /** + * For serialization and de-serialization only. + */ + public HoodieGlobalTimeline() { +this.activeTimeline = null; +this.archivedTimeline = null; +this.metaClient = null; + } + + @Override + public HoodieTimeline filterPendingCompactionTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingCompactionTimeline(); + } + + @Override + public HoodieTimeline filterPendingLogCompactionTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingLogCompactionTimeline(); + } + + @Override + public HoodieTimeline filterPendingMajorOrMinorCompactionTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingMajorOrMinorCompactionTimeline(); + } + + @Override + public HoodieTimeline filterPendingReplaceTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingReplaceTimeline(); + } + + @Override + public HoodieTimeline filterPendingRollbackTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingRollbackTimeline(); + } + + @Override + public HoodieTimeline filterRequestedRollbackTimeline() { +// override for efficiency +return this.activeTimeline.filterRequestedRollbackTimeline(); + } + + @Override + public HoodieTimeline filterPendingIndexTimeline() { +// override for efficiency +return this.activeTimeline.filterPendingIndexTimeline(); + } + + @Override + public boolean empty() { +return this.activeTimeline.empty(); + } + + /** + * Returns whether the active timeline contains the given instant or the instant is archived. + * Needs to rethink the new semantics and rename this method with global timeline introduced. + */ + @Override + public boolean containsOrBeforeTimelineStarts(String ts) { +return this.activeTimeline.containsOrBeforeTimelineStarts(ts); + } + + @Override + public boolean isBeforeTimelineStarts(String ts) { +return this.activeTimeline.isBeforeTimelineStarts(ts); + } + + @Override + public Option getF
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1522286800 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ValidationUtils; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; + +import java.util.List; +import java.util.Objects; +import java.util.function.Function; + +/** + * A global timeline view with both active and archived timeline involved. + */ +public class HoodieGlobalTimeline extends HoodieDefaultTimeline { + private static final long serialVersionUID = 1L; + private static final Logger LOG = LoggerFactory.getLogger(HoodieGlobalTimeline.class); + private final HoodieTableMetaClient metaClient; + private final HoodieActiveTimeline activeTimeline; + private final HoodieArchivedTimeline archivedTimeline; + + protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, Option startInstant) { +this.metaClient = metaClient; +this.activeTimeline = new HoodieActiveTimeline(metaClient); +archivedTimeline = startInstant.isPresent() ? new HoodieArchivedTimeline(metaClient, startInstant.get()) : new HoodieArchivedTimeline(metaClient); +this.details = FederatedDetails.create(this.activeTimeline, archivedTimeline); +setInstants(mergeInstants(archivedTimeline.getInstants(), activeTimeline.getInstants())); Review Comment: Yeah, that could be more smart, in this PR, when the `startTs` is empty, the archived timeline loading actually does nothing, but I think we can even make the loading lazy when `startTs` instant itself is active. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1522285239 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -167,12 +167,13 @@ public HoodieDefaultTimeline getWriteTimeline() { @Override public HoodieTimeline getContiguousCompletedWriteTimeline() { -Option earliestPending = getWriteTimeline().filterInflightsAndRequested().firstInstant(); +HoodieDefaultTimeline writeTimeline = getWriteTimeline(); Review Comment: Just a small optimization by hand, not relevent with this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
danny0405 commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1522284552 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.DummyActiveAction; +import org.apache.hudi.client.timeline.LSMTimelineWriter; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.engine.LocalTaskContextSupplier; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.index.HoodieIndex; + +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Test cases for {@link HoodieGlobalTimeline}. + */ +public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness { + @BeforeEach + public void setUp() throws Exception { +initMetaClient(); + } + + @AfterEach + public void tearDown() throws Exception { +cleanMetaClient(); + } + + /** + * The test for checking whether an instant is archived. + */ + @Test + void testArchivingCheck() throws Exception { +writeArchivedTimeline(10, 1000, 50); +writeActiveTimeline(1050, 10); +HoodieGlobalTimeline globalTimeline = new HoodieGlobalTimeline(this.metaClient, Option.empty()); +assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant should be active"); Review Comment: Agree, kind of think `isArchived` is the proper naming, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] DOCS-updated videos [hudi]
nfarah86 commented on PR #10855: URL: https://github.com/apache/hudi/pull/10855#issuecomment-1992652425 https://github.com/apache/hudi/assets/5392555/264db195-dfe0-4712-b12f-d60122feb8dc";> @bhasudha @xushiyan pr for videos -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] updated videos [hudi]
nfarah86 opened a new pull request, #10855: URL: https://github.com/apache/hudi/pull/10855 ### Change Logs updated videos ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update none - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]
Tyler-Rendina commented on issue #10590: URL: https://github.com/apache/hudi/issues/10590#issuecomment-1992510921 While I can kick off backfills, they eventually fail along side streams with `java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE][DOCS] Add more users in the powered-by page [hudi]
bhasudha commented on PR #10854: URL: https://github.com/apache/hudi/pull/10854#issuecomment-1992331710 Tested locally. Screenshots here! ![Screenshot 2024-03-12 at 11 49 27 AM](https://github.com/apache/hudi/assets/2179254/d0d65d05-d081-44dc-854f-cafed6126cfe) ![Screenshot 2024-03-12 at 11 49 42 AM](https://github.com/apache/hudi/assets/2179254/de49d177-33d6-497f-9c02-425db1e274c3) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [DO NOT MERGE][DOCS] Add more users in the powered-by page [hudi]
bhasudha opened a new pull request, #10854: URL: https://github.com/apache/hudi/pull/10854 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]
Tyler-Rendina commented on issue #10590: URL: https://github.com/apache/hudi/issues/10590#issuecomment-1992114835 Final note, apologies for the amount of posts, but this may help EMR users with Glue as their Hive service. Make sure to build Hudi using Java 8, if you are on ARM use something like Azul OpenJDK and export $JAVA_HOME as the provided path, i.e., /Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home. Once you upload your jars to s3, bootstrap such as: ``` sudo chown -R $USER:root /usr/lib/hudi sudo chmod -R ugo+rw /usr/lib/hudi aws s3 cp s3://BUCKET/jars/hudi-aws-bundle-0.14.1.jar /usr/lib/hudi aws s3 cp s3://BUCKET/jars/hudi-spark3.3-bundle_2.12-0.14.1.jar /usr/lib/hudi sudo ln -sf /usr/lib/hudi/hudi-aws-bundle-0.14.1.jar hudi-aws-bundle.jar sudo ln -sf /usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar hudi-spark3.3-bundle.jar ``` To use your custom built Hudi package, conform to your bootstrap paths in the following spark submit command element: ``` "--jars", "/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar,/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar,", "--conf", "spark.driver.extraClassPath=/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar:/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar", "--conf", "spark.executor.extraClassPath=/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar:/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar", ``` Finally, (the issue I had with Class Not Found com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory), **DO NOT** use .enableHiveSupport() when setting your spark context, while this works with hudi imported with --packages, it will try to use the wrong hive package when you specify --jars. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [DOCS][MINOR] Update powered-by page (#10853)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 03ea1e1b180 [DOCS][MINOR] Update powered-by page (#10853) 03ea1e1b180 is described below commit 03ea1e1b18086bc7f2e877a188d8a33db965cc40 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Tue Mar 12 07:25:20 2024 -0700 [DOCS][MINOR] Update powered-by page (#10853) --- website/src/pages/powered-by.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/website/src/pages/powered-by.md b/website/src/pages/powered-by.md index 6b5f57cc486..f26fe68741f 100644 --- a/website/src/pages/powered-by.md +++ b/website/src/pages/powered-by.md @@ -11,8 +11,6 @@ of people and organizations from all around the globe. The following is a small contributed to the Apache Hudi community! [Join us on slack](https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg), or come to one of our [virtual community events](https://hudi.apache.org/community/syncs). - - ### 37 Interactive Entertainment [37 Interactive Entertainment](https://www.37wan.net/) is a global Top20 listed game company, and a leading company on A-shares market of China. Apache Hudi is integrated into our Data Middle Platform offering real-time data warehouse and solving the problem of frequent changes of data.
Re: [PR] [DOCS][MINOR] Update powered-by page [hudi]
xushiyan merged PR #10853: URL: https://github.com/apache/hudi/pull/10853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]
vinothchandar commented on code in PR #10845: URL: https://github.com/apache/hudi/pull/10845#discussion_r1521484287 ## hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.DummyActiveAction; +import org.apache.hudi.client.timeline.LSMTimelineWriter; +import org.apache.hudi.common.engine.HoodieEngineContext; +import org.apache.hudi.common.engine.HoodieLocalEngineContext; +import org.apache.hudi.common.engine.LocalTaskContextSupplier; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.WriteOperationType; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.common.testutils.HoodieTestTable; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.config.HoodieIndexConfig; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.index.HoodieIndex; + +import org.apache.hadoop.conf.Configuration; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Test cases for {@link HoodieGlobalTimeline}. + */ +public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness { + @BeforeEach + public void setUp() throws Exception { +initMetaClient(); + } + + @AfterEach + public void tearDown() throws Exception { +cleanMetaClient(); + } + + /** + * The test for checking whether an instant is archived. + */ + @Test + void testArchivingCheck() throws Exception { +writeArchivedTimeline(10, 1000, 50); +writeActiveTimeline(1050, 10); +HoodieGlobalTimeline globalTimeline = new HoodieGlobalTimeline(this.metaClient, Option.empty()); +assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant should be active"); Review Comment: I see what you are alluding to. For now, let's rename these `isBeforeActiveTimelineStarts` to make it explicit. We can then clean up these methods. ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java: ## @@ -167,12 +167,13 @@ public HoodieDefaultTimeline getWriteTimeline() { @Override public HoodieTimeline getContiguousCompletedWriteTimeline() { -Option earliestPending = getWriteTimeline().filterInflightsAndRequested().firstInstant(); +HoodieDefaultTimeline writeTimeline = getWriteTimeline(); Review Comment: is this now changing this method to look for contiguous timeline on both active + archived? whats the effect of this change ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java: ## @@ -0,0 +1,203 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.util.Opt
Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]
bksrepo commented on issue #10609: URL: https://github.com/apache/hudi/issues/10609#issuecomment-1991619367 I am using spark 3.4.1 with hudi bundle 'hudi-spark3.4-bundle_2.12-0.14.0.jar', Hadoop is 3.3.6 and source database is mysql version 8.0.36 Reported ERROR comes at the time of saving the data-frame. upto df.show() code works fine. = from pyspark.sql import SparkSession,functions from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DecimalType, DateType, TimestampType, BooleanType # SparkSession spark = SparkSession.builder \ .appName('Sample_COW') \ .config("spark.yarn.jars", "/opt/spark-3.4.1-bin-hadoop3/jars/*.jar") \ .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \ .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.hudi.catalog.HoodieCatalog') \ .config('spark.kryo.registrator','org.apache.spark.HoodieSparkKryoRegistrar') \ .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \ .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \ .config('spark.sql.warehouse.dir','hdfs://nn:8020/mnt/hive/warehouse') \ .config('spark.sql.debug.maxToStringFields', '200') \ .config('spark.hadoop.fs.defaultFS','hdfs://Name-Node-Server:8020') \ .config('spark.executor.extraClassPath','/opt/spark-3.4.1-bin-hadoop3/jars/jackson-databind-2.14.2.jar') \ .config('spark.driver.extraClassPath','/opt/spark-3.4.1-bin-hadoop3/jars/jackson-databind-2.14.2.jar') \ .config('spark.hadoop.yarn.resourcemanager.hostname','Name-Node-Server') \ .config("spark.sql.hive.convertMetastoreParquet", "true") \ .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2") \ .config("spark.hadoop.fs.replication", "1") \ .enableHiveSupport() \ .getOrCreate() # Define MySQL connection properties along with whole table as a dump or selective columns with a where clause. mysql_props = { "url": "jdbc:mysql://localhost:3306/", "driver": "com.mysql.cj.jdbc.Driver", "user": "Xx", "password": "X", "dbtable": "(select id, pid, center_id, center_code, visit_type, create_price_list_id, gender, age, age_frequency, clinical_detail, clinical_history_file, sample_drawn_date, sample_drawn_time_hrs, sample_drawn_time_min, referal_doctor_id, referal_doctor, referal_customer_id, referal_customer, department_id, profile_ids, test_ids, amount, discount, total_amount, mrp, payment_mode, amount_paid, amount_balance, test_status_code, UNIX_TIMESTAMP(log_date_created) AS log_date_created, created_by, deleted, sample_status, other_comments, team_lead_id, tech_lead_id, pathologist_id, tele_pathologist_id, Graph_path, UNIX_TIMESTAMP(CONVERT_TZ(authentication_date,'+05:30','+00:00')) AS authentication_date, reference_patient_id, protocol_id, visit_info, ref_center, investigator_details, month_year, UNIX_TIMESTAMP(CONVERT_TZ(sample_collection_datetime_at_source,'+05:30', '+00:00')) AS sample_collection_timestamp FROM sample) as sample" } # Read data from MySQL df = spark.read.format("jdbc").options(**mysql_props).load() # Define Hudi tables schema to avoide any auto FieldType conversion and casting issues. hoodie_schema = StructType([ StructField("id", IntegerType(), True), StructField("pid", StringType(), True), StructField("center_id", IntegerType(), True), StructField("center_code", StringType(), True), StructField("visit_type", StringType(), True), StructField("create_price_list_id", IntegerType(), True), StructField("gender", StringType(), True), StructField("age", IntegerType(), True), StructField("age_frequency", StringType(), True), StructField("clinical_detail", StringType(), True), StructField("clinical_history_file", StringType(), True), StructField("sample_drawn_date", DateType(), True), StructField("sample_drawn_time_hrs", StringType(), True), StructField("sample_drawn_time_min", StringType(), True), StructField("referal_doctor_id", StringType(), True), StructField("referal_doctor", StringType(), True), StructField("referal_customer_id", StringType(), True),
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991552838 ## CI report: * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
geserdugarov commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991532665 Need to figure out the reason of failures in `TestCDCDataFrameSuite.testCOWDataSourceWrite`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825613#comment-17825613 ] Geser Dugarov edited comment on HUDI-7493 at 3/12/24 12:17 PM: --- Could be labeled by the "Config Simplification" epic. was (Author: JIRAUSER301110): Could be label by the ["Config Simplification" epic|https://issues.apache.org/jira/browse/HUDI-5738]. > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [DOCS][MINOR] Update powered-by page [hudi]
bhasudha opened a new pull request, #10853: URL: https://github.com/apache/hudi/pull/10853 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none"._ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]
NishantBaheti commented on issue #10850: URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991494289 ![image](https://github.com/apache/hudi/assets/31793052/89d5982e-f029-4bca-8438-0b623a99d6b8) doesn't work. another issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991474440 ## CI report: * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]
hudi-bot commented on PR #10851: URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991462452 ## CI report: * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]
ad1happy2go commented on issue #10850: URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991461575 @NishantBaheti I checked before, incremental query works fine with 0.14.1. can you paste the full reproducible script or table/writer properties you used to populate. Which writer you used to populate this table? I checked the below code to quickly to reproduce - https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]
ad1happy2go commented on issue #10850: URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991460541 @NishantBaheti I checked before, incremental query works fine with 0.14.1. can you paste the full reproducible script or table/writer properties you used to populate. I checked the below code to quickly to reproduce - https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]
jayesh2424 commented on issue #10852: URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991447595 @ad1happy2go Okay, May be the question is not clear. But What you have suggested is have a full load of entire datalake. Then have it in a dataframe. So that doing df.createOrReplaceTempView("temp_table") will work for further filtering. What you have proposed is example of Loading the datalake and then doing a filter job. What I am saying is filter the datalake and load that much part only. For example, This is how we load the datalake :-> datalake_full_load = self.spark.read.format('org.apache.hudi').load(target_path) here we can use .select() for pulling only particular columns or .filter() etc. What I want is something like datalake_full_load = self.spark.read.format('org.apache.hudi').load(target_path).filter("select date(created) as created, count(*) as datalake_count from datalake group by date(created)") My SQL might sound weird but don't it word to word. I may want to achieve similar result but this is not something I am pointing. I am just saying what window of data I want from the Datalake. Also the partition thing you said. My datalake is partitioned but it's not partition on the basis of dates. And I want data based on dates. Here, the created is timestamp value and hence not used as a partition in my datalake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]
ad1happy2go commented on issue #10852: URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991416169 @jayesh2424 Sorry but I am not exactly clear of the question. In case you are asking how to read a specific part of table, You can read a data frame and do where/filter on that. If your dataset is partitioned and if you apply the partition column filter then it is only going to read those subdirectories only (in case you meant that by part of data lake) In case you want to use sql, you can do - df.createOrReplaceTempView("temp_table") spark.sql("select * from temp_table where a = 1") -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]
jayesh2424 commented on issue #10852: URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991371544 @xushiyan, @ad1happy2go and @codope could you please help me out with this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]
jayesh2424 opened a new issue, #10852: URL: https://github.com/apache/hudi/issues/10852 I have a Hudi datalake in my AWS. Currently to have a ETL operation I usually use the full load of Hudi Datalake for my operations. I want to know how Can I have a particular set of data only from the Hudi datalake. What I really want to achieve is a method like create_dynamic_frame.from_options(). Where we sent a samplequery to the database and fetch particular set of data only. Just like that I want to send a SQL query to the Hudi Datalake. The main goal is rather loading datalake and then filter out. I want to filter out the datalake and then load the particular part of datalake only. It will be great if I am able to load this particular data with help of spark sql. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7493] Consistent naming of Clean configuration parameters [hudi]
geserdugarov opened a new pull request, #10851: URL: https://github.com/apache/hudi/pull/10851 ### Change Logs `ConfigProperty.key()` and `ConfigOption.key()` are used for docs generating, and we need to move towards consistent naming of all parameters in Hudi. This MR proposes the first step, which is changing names of Cleaner configuration parameters only. The following table shows proposed changes. | New main key | Previous key in `HoodieCleanConfig` | Previous key in `FlinkOptions` | | - | - | - | | hoodie.cleaner.automatic | hoodie.clean.automatic | | | hoodie.cleaner.async.enabled | hoodie.clean.async | clean.async.enabled | | hoodie.cleaner.multiple.enabled | hoodie.clean.allow.multiple | | | hoodie.cleaner.incremental.enabled | hoodie.cleaner.incremental.mode | | | hoodie.cleaner.parallelism | hoodie.cleaner.parallelism | | | hoodie.cleaner.policy | hoodie.cleaner.policy | clean.policy | | hoodie.cleaner.commits.retained | hoodie.cleaner.commits.retained | clean.retain_commits | | hoodie.cleaner.fileversions.retained | hoodie.cleaner.fileversions.retained | clean.retain_file_versions | | hoodie.cleaner.hours.retained | hoodie.cleaner.hours.retained | clean.retain_hours | | hoodie.cleaner.trigger.strategy | hoodie.clean.trigger.strategy | | | hoodie.cleaner.trigger.max.commits | hoodie.clean.max.commits | | | hoodie.cleaner.delete.bootstrap.base.file | hoodie.cleaner.delete.bootstrap.base.file | | | hoodie.cleaner.failed.writes.policy | hoodie.cleaner.policy.failed.writes | | Currently, we have about 900+ parameters in Hudi, as I mentioned in [HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738). Note, that my table contains only 899 of them, and missed, for instance, recently added TTL parameters. ### Impact To save backward compatibility, old keys are available, which is done by using `ConfigProperty.withAlternatives("...")` for `HoodieCleanConfig` or `ConfigOption.withFallbackKeys("...")` for `FlinkOptions`. ### Risk level (write none, low medium or high below) Low ### Documentation Update I will prepare corresponding MR with updates to the documentation. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7493: - Labels: pull-request-available (was: ) > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7493) Clean configuration for clean service
[ https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825613#comment-17825613 ] Geser Dugarov commented on HUDI-7493: - Could be label by the ["Config Simplification" epic|https://issues.apache.org/jira/browse/HUDI-5738]. > Clean configuration for clean service > - > > Key: HUDI-7493 > URL: https://issues.apache.org/jira/browse/HUDI-7493 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > > Sometimes we use `{{{}hoodie.clean.*`{}}} and sometimes > `{{{}hoodie.cleaner.*`.{}}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]
ranwani commented on PR #9717: URL: https://github.com/apache/hudi/pull/9717#issuecomment-1991259976 @yihua : We need to use Hudi with Spark 3.5. Can you let me know when is Hudi 0.15.0 release planned? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]
NishantBaheti commented on issue #10850: URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991167509 Hello, I am using this jar - hudi-spark3.3-bundle_2.12-0.14.1.jar - spark 3.3 - hudi 0.14.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]
danny0405 commented on issue #10850: URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991109123 Hi, @NishantBaheti , thanks for your feedback, could you also supplement the release version for Spark and Hudi respectively. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]
danny0405 commented on code in PR #10727: URL: https://github.com/apache/hudi/pull/10727#discussion_r1521102072 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java: ## @@ -202,7 +202,9 @@ private Option> composeSchemaEvolutionTrans Schema newWriterSchema = AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName()); Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, newWriterSchema.getFullName()); boolean needToReWriteRecord = sameCols.size() != colNamesFromWriteSchema.size() - || SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE; + && SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() + == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE; + Review Comment: So when the column size equals, there is no need to rewrite no matter whether the schema is compatible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]
xiarixiaoyao commented on code in PR #10727: URL: https://github.com/apache/hudi/pull/10727#discussion_r1521039170 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java: ## @@ -202,7 +202,9 @@ private Option> composeSchemaEvolutionTrans Schema newWriterSchema = AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName()); Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, newWriterSchema.getFullName()); boolean needToReWriteRecord = sameCols.size() != colNamesFromWriteSchema.size() - || SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE; + || !(SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, writeSchemaFromFile).getType() Review Comment: The original logic has certain performance issues If the read-write schema is compatible, i think we no need rewrite the entire record. since we can read from old parquet file by new schema correctly. @danny0405 @ThinkerLei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Incremental query not working on COW table [hudi]
NishantBaheti opened a new issue, #10850: URL: https://github.com/apache/hudi/issues/10850 ## Error Error Category: QUERY_ERROR; AnalysisException: Found duplicate column(s) in the data schema: `_hoodie_commit_seqno`, `_hoodie_commit_time`, `_hoodie_file_name`, `_hoodie_partition_path`, `_hoodie_record_key` ## hudi_options={ 'hoodie.datasource.query.type': 'incremental', 'hoodie.datasource.read.begin.instanttime': start_time, 'hoodie.datasource.read.end.instanttime': end_time, } df=spark.read\ .format("org.apache.hudi")\ .options(**hudi_options)\ .load(tablePath) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org