Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993682626

   
   ## CI report:
   
   * 7bce9399d616a570e8a04c783b06e7e2f404dc5a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22886)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7492] fix the issue of incorrect keygenerator specification when creating m… [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10840:
URL: https://github.com/apache/hudi/pull/10840#issuecomment-1993682540

   
   ## CI report:
   
   * cf41aa0ce79b39dc6f09db500db4b123fed34ff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22877)
 
   * c6d233de457320d91579376f3d4669ee4dcf8f50 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22887)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7492] fix the issue of incorrect keygenerator specification when creating m… [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10840:
URL: https://github.com/apache/hudi/pull/10840#issuecomment-1993675304

   
   ## CI report:
   
   * cf41aa0ce79b39dc6f09db500db4b123fed34ff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22877)
 
   * c6d233de457320d91579376f3d4669ee4dcf8f50 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 [hudi]

2024-03-12 Thread via GitHub


codope closed issue #9119: [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error 
upserting bucketType UPDATE for partition :13
URL: https://github.com/apache/hudi/issues/9119


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 [hudi]

2024-03-12 Thread via GitHub


ad1happy2go commented on issue #9119:
URL: https://github.com/apache/hudi/issues/9119#issuecomment-1993660645

Closing this as this was Fixed via: https://github.com/apache/hudi/pull/9984


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993620652

   
   ## CI report:
   
   * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884)
 
   * 7bce9399d616a570e8a04c783b06e7e2f404dc5a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22886)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993614748

   
   ## CI report:
   
   * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884)
 
   * 7bce9399d616a570e8a04c783b06e7e2f404dc5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


geserdugarov commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1993528107

   > Old key name `hoodie.clean.automatic` in 
`TestCDCDataFrameSuite.testCOWDataSourceWrite` doesn't work. Search for the 
reason.
   
   The reason is that in `HoodieCDCTestBase` `HoodieCleanConfig.AUTO_CLEAN.key 
-> "false"` is added to `commonOpts`.
   And if we add in `TestCDCDataFrameSuite.testCOWDataSourceWrite` by 
`.option("hoodie.clean.automatic", "true")` to the same `commonOpts` then we 
have two different value for a single parameter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993363802

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * 948d9ecb6dc661628b787ba800756b78d52791af Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993226243

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * 998a987f33866407fd1b2d8350e6c2f2386f59ad Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22882)
 
   * 948d9ecb6dc661628b787ba800756b78d52791af Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22885)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10845:
URL: https://github.com/apache/hudi/pull/10845#issuecomment-1993208787

   
   ## CI report:
   
   * c50e42d4b21dc1af358b61b0d814cfb50248bfe0 UNKNOWN
   * 998a987f33866407fd1b2d8350e6c2f2386f59ad Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22882)
 
   * 948d9ecb6dc661628b787ba800756b78d52791af UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r152204


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+
+import java.util.List;
+import java.util.Objects;
+import java.util.function.Function;
+
+/**
+ * A global timeline view with both active and archived timeline involved.
+ */
+public class HoodieGlobalTimeline extends HoodieDefaultTimeline {
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieGlobalTimeline.class);
+  private final HoodieTableMetaClient metaClient;
+  private final HoodieActiveTimeline activeTimeline;
+  private final HoodieArchivedTimeline archivedTimeline;
+
+  protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, 
Option startInstant) {
+this.metaClient = metaClient;
+this.activeTimeline = new HoodieActiveTimeline(metaClient);
+archivedTimeline = startInstant.isPresent() ? new 
HoodieArchivedTimeline(metaClient, startInstant.get()) : new 
HoodieArchivedTimeline(metaClient);
+this.details = FederatedDetails.create(this.activeTimeline, 
archivedTimeline);
+setInstants(mergeInstants(archivedTimeline.getInstants(), 
activeTimeline.getInstants()));

Review Comment:
   I kind of make the archived timeline loading smartly based on the given 
`startTs`, if we also wanna lazy loading for specific API invocation level, 
that might needs a refactoring of all the timelines to support lazy setup of 
the instants.
   
   Would prefer to to that in another PR if we reach consensus.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1522287754


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+
+import java.util.List;
+import java.util.Objects;
+import java.util.function.Function;
+
+/**
+ * A global timeline view with both active and archived timeline involved.
+ */
+public class HoodieGlobalTimeline extends HoodieDefaultTimeline {
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieGlobalTimeline.class);
+  private final HoodieTableMetaClient metaClient;
+  private final HoodieActiveTimeline activeTimeline;
+  private final HoodieArchivedTimeline archivedTimeline;
+
+  protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, 
Option startInstant) {
+this.metaClient = metaClient;
+this.activeTimeline = new HoodieActiveTimeline(metaClient);
+archivedTimeline = startInstant.isPresent() ? new 
HoodieArchivedTimeline(metaClient, startInstant.get()) : new 
HoodieArchivedTimeline(metaClient);
+this.details = FederatedDetails.create(this.activeTimeline, 
archivedTimeline);
+setInstants(mergeInstants(archivedTimeline.getInstants(), 
activeTimeline.getInstants()));
+  }
+
+  protected HoodieGlobalTimeline(HoodieActiveTimeline activeTimeline, 
HoodieArchivedTimeline archivedTimeline) {
+this.metaClient = activeTimeline.metaClient;
+this.activeTimeline = activeTimeline;
+this.archivedTimeline = archivedTimeline;
+this.details = FederatedDetails.create(this.activeTimeline, 
archivedTimeline);
+setInstants(mergeInstants(archivedTimeline.getInstants(), 
activeTimeline.getInstants()));
+  }
+
+  /**
+   * For serialization and de-serialization only.
+   */
+  public HoodieGlobalTimeline() {
+this.activeTimeline = null;
+this.archivedTimeline = null;
+this.metaClient = null;
+  }
+
+  @Override
+  public HoodieTimeline filterPendingCompactionTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingCompactionTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterPendingLogCompactionTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingLogCompactionTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterPendingMajorOrMinorCompactionTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingMajorOrMinorCompactionTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterPendingReplaceTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingReplaceTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterPendingRollbackTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingRollbackTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterRequestedRollbackTimeline() {
+// override for efficiency
+return this.activeTimeline.filterRequestedRollbackTimeline();
+  }
+
+  @Override
+  public HoodieTimeline filterPendingIndexTimeline() {
+// override for efficiency
+return this.activeTimeline.filterPendingIndexTimeline();
+  }
+
+  @Override
+  public boolean empty() {
+return this.activeTimeline.empty();
+  }
+
+  /**
+   * Returns whether the active timeline contains the given instant or the 
instant is archived.
+   * Needs to rethink the new semantics and rename this method with global 
timeline introduced.
+   */
+  @Override
+  public boolean containsOrBeforeTimelineStarts(String ts) {
+return this.activeTimeline.containsOrBeforeTimelineStarts(ts);
+  }
+
+  @Override
+  public boolean isBeforeTimelineStarts(String ts) {
+return this.activeTimeline.isBeforeTimelineStarts(ts);
+  }
+
+  @Override
+  public Option getF

Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1522286800


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+
+import java.util.List;
+import java.util.Objects;
+import java.util.function.Function;
+
+/**
+ * A global timeline view with both active and archived timeline involved.
+ */
+public class HoodieGlobalTimeline extends HoodieDefaultTimeline {
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieGlobalTimeline.class);
+  private final HoodieTableMetaClient metaClient;
+  private final HoodieActiveTimeline activeTimeline;
+  private final HoodieArchivedTimeline archivedTimeline;
+
+  protected HoodieGlobalTimeline(HoodieTableMetaClient metaClient, 
Option startInstant) {
+this.metaClient = metaClient;
+this.activeTimeline = new HoodieActiveTimeline(metaClient);
+archivedTimeline = startInstant.isPresent() ? new 
HoodieArchivedTimeline(metaClient, startInstant.get()) : new 
HoodieArchivedTimeline(metaClient);
+this.details = FederatedDetails.create(this.activeTimeline, 
archivedTimeline);
+setInstants(mergeInstants(archivedTimeline.getInstants(), 
activeTimeline.getInstants()));

Review Comment:
   Yeah, that could be more smart, in this PR, when the `startTs` is empty, the 
archived timeline loading actually does nothing, but I think we can even make 
the loading lazy when `startTs` instant itself is active.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1522285239


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -167,12 +167,13 @@ public HoodieDefaultTimeline getWriteTimeline() {
 
   @Override
   public HoodieTimeline getContiguousCompletedWriteTimeline() {
-Option earliestPending = 
getWriteTimeline().filterInflightsAndRequested().firstInstant();
+HoodieDefaultTimeline writeTimeline = getWriteTimeline();

Review Comment:
   Just a small optimization by hand, not relevent with this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1522284552


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.DummyActiveAction;
+import org.apache.hudi.client.timeline.LSMTimelineWriter;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.engine.LocalTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.common.testutils.HoodieTestTable;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test cases for {@link HoodieGlobalTimeline}.
+ */
+public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness {
+  @BeforeEach
+  public void setUp() throws Exception {
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanMetaClient();
+  }
+
+  /**
+   * The test for checking whether an instant is archived.
+   */
+  @Test
+  void testArchivingCheck() throws Exception {
+writeArchivedTimeline(10, 1000, 50);
+writeActiveTimeline(1050, 10);
+HoodieGlobalTimeline globalTimeline = new 
HoodieGlobalTimeline(this.metaClient, Option.empty());
+assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant 
should be active");

Review Comment:
   Agree, kind of think `isArchived` is the proper naming, WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DOCS-updated videos [hudi]

2024-03-12 Thread via GitHub


nfarah86 commented on PR #10855:
URL: https://github.com/apache/hudi/pull/10855#issuecomment-1992652425

   https://github.com/apache/hudi/assets/5392555/264db195-dfe0-4712-b12f-d60122feb8dc";>
   
   @bhasudha @xushiyan  pr for videos


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] updated videos [hudi]

2024-03-12 Thread via GitHub


nfarah86 opened a new pull request, #10855:
URL: https://github.com/apache/hudi/pull/10855

   ### Change Logs
   
   updated videos
   
   ### Impact
   
   none
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]

2024-03-12 Thread via GitHub


Tyler-Rendina commented on issue #10590:
URL: https://github.com/apache/hudi/issues/10590#issuecomment-1992510921

   While I can kick off backfills, they eventually fail along side streams with 
`java.lang.NoSuchMethodError: 
com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;`
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][DOCS] Add more users in the powered-by page [hudi]

2024-03-12 Thread via GitHub


bhasudha commented on PR #10854:
URL: https://github.com/apache/hudi/pull/10854#issuecomment-1992331710

   Tested locally. Screenshots here!
   ![Screenshot 2024-03-12 at 11 49 27 
AM](https://github.com/apache/hudi/assets/2179254/d0d65d05-d081-44dc-854f-cafed6126cfe)
   ![Screenshot 2024-03-12 at 11 49 42 
AM](https://github.com/apache/hudi/assets/2179254/de49d177-33d6-497f-9c02-425db1e274c3)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [DO NOT MERGE][DOCS] Add more users in the powered-by page [hudi]

2024-03-12 Thread via GitHub


bhasudha opened a new pull request, #10854:
URL: https://github.com/apache/hudi/pull/10854

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/module/scala/DefaultScalaModule$ when doing an Incremental CDC Query in 0.14.1 [hudi]

2024-03-12 Thread via GitHub


Tyler-Rendina commented on issue #10590:
URL: https://github.com/apache/hudi/issues/10590#issuecomment-1992114835

   Final note, apologies for the amount of posts, but this may help EMR users 
with Glue as their Hive service.
   
   Make sure to build Hudi using Java 8, if you are on ARM use something like 
Azul OpenJDK and export $JAVA_HOME as the provided path, i.e., 
/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home.
   
   Once you upload your jars to s3, bootstrap such as:
   ```
   sudo chown -R $USER:root /usr/lib/hudi
   sudo chmod -R ugo+rw /usr/lib/hudi
   aws s3 cp s3://BUCKET/jars/hudi-aws-bundle-0.14.1.jar /usr/lib/hudi
   aws s3 cp s3://BUCKET/jars/hudi-spark3.3-bundle_2.12-0.14.1.jar /usr/lib/hudi
   sudo ln -sf /usr/lib/hudi/hudi-aws-bundle-0.14.1.jar hudi-aws-bundle.jar
   sudo ln -sf /usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar 
hudi-spark3.3-bundle.jar
   ```
   
   To use your custom built Hudi package, conform to your bootstrap paths in 
the following spark submit command element:
   ```
   "--jars",
   
"/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar,/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar,",
   "--conf",
   
"spark.driver.extraClassPath=/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar:/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar",
   "--conf",
   
"spark.executor.extraClassPath=/usr/lib/hudi/hudi-aws-bundle-0.14.1.jar:/usr/lib/hudi/hudi-spark3.3-bundle_2.12-0.14.1.jar",
   ```
   
   Finally, (the issue I had with Class Not Found 
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory), **DO 
NOT** use .enableHiveSupport() when setting your spark context, while this 
works with hudi imported with --packages, it will try to use the wrong hive 
package when you specify --jars.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch asf-site updated: [DOCS][MINOR] Update powered-by page (#10853)

2024-03-12 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 03ea1e1b180 [DOCS][MINOR] Update powered-by page (#10853)
03ea1e1b180 is described below

commit 03ea1e1b18086bc7f2e877a188d8a33db965cc40
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Tue Mar 12 07:25:20 2024 -0700

[DOCS][MINOR] Update powered-by page (#10853)
---
 website/src/pages/powered-by.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/website/src/pages/powered-by.md b/website/src/pages/powered-by.md
index 6b5f57cc486..f26fe68741f 100644
--- a/website/src/pages/powered-by.md
+++ b/website/src/pages/powered-by.md
@@ -11,8 +11,6 @@ of people and organizations from all around the globe. The 
following is a small
 contributed to the Apache Hudi community! [Join us on 
slack](https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg),
 
 or come to one of our [virtual community 
events](https://hudi.apache.org/community/syncs).
 
-
-
 ### 37 Interactive Entertainment
 [37 Interactive Entertainment](https://www.37wan.net/) is a global Top20 
listed game company, and a leading company on A-shares market of China.
 Apache Hudi is integrated into our Data Middle Platform offering real-time 
data warehouse and solving the problem of frequent changes of data.



Re: [PR] [DOCS][MINOR] Update powered-by page [hudi]

2024-03-12 Thread via GitHub


xushiyan merged PR #10853:
URL: https://github.com/apache/hudi/pull/10853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7497] Add a global timeline mingled with active and archived instants [hudi]

2024-03-12 Thread via GitHub


vinothchandar commented on code in PR #10845:
URL: https://github.com/apache/hudi/pull/10845#discussion_r1521484287


##
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieGlobalTimeline.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.DummyActiveAction;
+import org.apache.hudi.client.timeline.LSMTimelineWriter;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.engine.HoodieLocalEngineContext;
+import org.apache.hudi.common.engine.LocalTaskContextSupplier;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
+import org.apache.hudi.common.testutils.HoodieTestTable;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.apache.hadoop.conf.Configuration;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test cases for {@link HoodieGlobalTimeline}.
+ */
+public class TestHoodieGlobalTimeline extends HoodieCommonTestHarness {
+  @BeforeEach
+  public void setUp() throws Exception {
+initMetaClient();
+  }
+
+  @AfterEach
+  public void tearDown() throws Exception {
+cleanMetaClient();
+  }
+
+  /**
+   * The test for checking whether an instant is archived.
+   */
+  @Test
+  void testArchivingCheck() throws Exception {
+writeArchivedTimeline(10, 1000, 50);
+writeActiveTimeline(1050, 10);
+HoodieGlobalTimeline globalTimeline = new 
HoodieGlobalTimeline(this.metaClient, Option.empty());
+assertTrue(globalTimeline.isBeforeTimelineStarts("1049"), "The instant 
should be active");

Review Comment:
   I see what you are alluding to. For now, let's rename these 
`isBeforeActiveTimelineStarts` to make it explicit. We can then clean up these 
methods. 



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java:
##
@@ -167,12 +167,13 @@ public HoodieDefaultTimeline getWriteTimeline() {
 
   @Override
   public HoodieTimeline getContiguousCompletedWriteTimeline() {
-Option earliestPending = 
getWriteTimeline().filterInflightsAndRequested().firstInstant();
+HoodieDefaultTimeline writeTimeline = getWriteTimeline();

Review Comment:
   is this now changing this method to look for contiguous timeline on both 
active + archived? whats the effect of this change



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieGlobalTimeline.java:
##
@@ -0,0 +1,203 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.timeline;
+
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.Opt

Re: [I] RLI Spark Hudi Error occurs when executing map [hudi]

2024-03-12 Thread via GitHub


bksrepo commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-1991619367

   I am using spark 3.4.1 with hudi bundle 
'hudi-spark3.4-bundle_2.12-0.14.0.jar', Hadoop is 3.3.6 and source database is 
mysql version 8.0.36
   
   
   
   Reported ERROR comes at the time of saving the data-frame. upto df.show() 
code works fine.
   
=
   
   from pyspark.sql import SparkSession,functions
   from pyspark.sql.types import StructType, StructField, IntegerType, 
StringType, DecimalType, DateType, TimestampType, BooleanType
   
   
   # SparkSession
   
   
   spark = SparkSession.builder \
   .appName('Sample_COW') \
   .config("spark.yarn.jars", "/opt/spark-3.4.1-bin-hadoop3/jars/*.jar") \
   .config('spark.sql.extensions', 
'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
   .config('spark.sql.catalog.spark_catalog', 
'org.apache.spark.sql.hudi.catalog.HoodieCatalog') \
   
.config('spark.kryo.registrator','org.apache.spark.HoodieSparkKryoRegistrar') \
   .config('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer') \
   .config("spark.sql.sources.partitionOverwriteMode", "dynamic") \
   .config('spark.sql.warehouse.dir','hdfs://nn:8020/mnt/hive/warehouse') \
   .config('spark.sql.debug.maxToStringFields', '200') \
   .config('spark.hadoop.fs.defaultFS','hdfs://Name-Node-Server:8020') \
   
.config('spark.executor.extraClassPath','/opt/spark-3.4.1-bin-hadoop3/jars/jackson-databind-2.14.2.jar')
 \
   
.config('spark.driver.extraClassPath','/opt/spark-3.4.1-bin-hadoop3/jars/jackson-databind-2.14.2.jar')
 \
   .config('spark.hadoop.yarn.resourcemanager.hostname','Name-Node-Server') 
\
   .config("spark.sql.hive.convertMetastoreParquet", "true") \
   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
"2") \
   .config("spark.hadoop.fs.replication", "1") \
   .enableHiveSupport() \
   .getOrCreate()
   
   
   
   # Define MySQL connection properties along with whole table as a dump or 
selective columns with a where clause.
   
   
   mysql_props = {
   "url": "jdbc:mysql://localhost:3306/",
   "driver": "com.mysql.cj.jdbc.Driver",
   "user": "Xx",
   "password": "X",
   "dbtable": "(select id, pid, center_id, center_code, visit_type, 
create_price_list_id, gender, age, age_frequency, clinical_detail, 
clinical_history_file, sample_drawn_date, sample_drawn_time_hrs, 
sample_drawn_time_min, referal_doctor_id, referal_doctor, referal_customer_id, 
referal_customer, department_id, profile_ids, test_ids, amount, discount, 
total_amount, mrp, payment_mode, amount_paid, amount_balance, test_status_code, 
UNIX_TIMESTAMP(log_date_created) AS log_date_created, created_by, deleted, 
sample_status, other_comments, team_lead_id, tech_lead_id, pathologist_id, 
tele_pathologist_id, Graph_path, 
UNIX_TIMESTAMP(CONVERT_TZ(authentication_date,'+05:30','+00:00')) AS 
authentication_date, reference_patient_id, protocol_id, visit_info, ref_center, 
investigator_details, month_year, 
UNIX_TIMESTAMP(CONVERT_TZ(sample_collection_datetime_at_source,'+05:30', 
'+00:00')) AS sample_collection_timestamp FROM sample) as sample"
   }
   
   # Read data from MySQL
   df = spark.read.format("jdbc").options(**mysql_props).load()
   
   
   # Define Hudi tables schema to avoide any auto FieldType conversion and 
casting issues.
   
   hoodie_schema = StructType([
   StructField("id", IntegerType(), True),
   StructField("pid", StringType(), True),
   StructField("center_id", IntegerType(), True),
   StructField("center_code", StringType(), True),
   StructField("visit_type", StringType(), True),
   StructField("create_price_list_id", 
IntegerType(), True),
   StructField("gender", StringType(), True),
   StructField("age", IntegerType(), True),
   StructField("age_frequency", StringType(), True),
   StructField("clinical_detail", StringType(), 
True),
   StructField("clinical_history_file", 
StringType(), True),
   StructField("sample_drawn_date", DateType(), 
True),
   StructField("sample_drawn_time_hrs", 
StringType(), True),
   StructField("sample_drawn_time_min", 
StringType(), True),
   StructField("referal_doctor_id", StringType(), 
True),
   StructField("referal_doctor", StringType(), 
True),
   StructField("referal_customer_id", StringType(), 
True),

Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991552838

   
   ## CI report:
   
   * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


geserdugarov commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991532665

   Need to figure out the reason of failures in 
`TestCDCDataFrameSuite.testCOWDataSourceWrite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (HUDI-7493) Clean configuration for clean service

2024-03-12 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825613#comment-17825613
 ] 

Geser Dugarov edited comment on HUDI-7493 at 3/12/24 12:17 PM:
---

Could be labeled by the "Config Simplification" epic.


was (Author: JIRAUSER301110):
Could be label by the ["Config Simplification" 
epic|https://issues.apache.org/jira/browse/HUDI-5738].

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [DOCS][MINOR] Update powered-by page [hudi]

2024-03-12 Thread via GitHub


bhasudha opened a new pull request, #10853:
URL: https://github.com/apache/hudi/pull/10853

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


NishantBaheti commented on issue #10850:
URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991494289

   
![image](https://github.com/apache/hudi/assets/31793052/89d5982e-f029-4bca-8438-0b623a99d6b8)
   
   
   doesn't work. another issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991474440

   
   ## CI report:
   
   * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22884)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7493] Consistent naming of Cleaner configuration parameters [hudi]

2024-03-12 Thread via GitHub


hudi-bot commented on PR #10851:
URL: https://github.com/apache/hudi/pull/10851#issuecomment-1991462452

   
   ## CI report:
   
   * 7a74bf1e9e175c7ea4ad31c99f6fc88db81b46ee UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


ad1happy2go commented on issue #10850:
URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991461575

   @NishantBaheti I checked before, incremental query works fine with 0.14.1. 
   
   can you paste the full reproducible script or table/writer properties you 
used to populate. Which writer you used to populate this table?
   
   I checked the below code to quickly to reproduce - 
   https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


ad1happy2go commented on issue #10850:
URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991460541

   @NishantBaheti I checked before, incremental query works fine with 0.14.1. 
   can you paste the full reproducible script or table/writer properties you 
used to populate.
   I checked the below code to quickly to reproduce - 
   https://gist.github.com/ad1happy2go/e7a2f8c695fde4c3db060a7113610931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

2024-03-12 Thread via GitHub


jayesh2424 commented on issue #10852:
URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991447595

   @ad1happy2go Okay,
   May be the question is not clear. But What you have suggested is have a full 
load of entire datalake. Then have it in a dataframe. So that doing 
df.createOrReplaceTempView("temp_table") will work for further filtering. 
   What you have proposed is example of Loading the datalake and then doing a 
filter job.
   
   What I am saying is filter the datalake and load that much part only.
   For example,
   This is how we load the datalake :-> datalake_full_load = 
self.spark.read.format('org.apache.hudi').load(target_path)
   
   here we can use .select() for pulling only particular columns or .filter() 
etc.
   What I want is something like 
   datalake_full_load = 
self.spark.read.format('org.apache.hudi').load(target_path).filter("select 
date(created) as created, count(*) as datalake_count from datalake group by 
date(created)")
   
   My SQL might sound weird but don't it word to word. I may want to achieve 
similar result but this is not something I am pointing. I am just saying what 
window of data I want from the Datalake. 
   
   Also the partition thing you said. My datalake is partitioned but it's not 
partition on the basis of dates. And I want data based on dates. Here, the 
created is timestamp value and hence not used as a partition in my datalake.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

2024-03-12 Thread via GitHub


ad1happy2go commented on issue #10852:
URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991416169

   @jayesh2424 Sorry but I am not exactly clear of the question. 
   
   In case you are asking how to read a specific part of table, You can read a 
data frame and do where/filter on that. If your dataset is partitioned and if 
you apply the partition column filter then it is only going to read those 
subdirectories only (in case you meant that by part of data lake)
   
   In case you want to use sql, you can do - 
   df.createOrReplaceTempView("temp_table")
   spark.sql("select * from temp_table where a = 1")


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

2024-03-12 Thread via GitHub


jayesh2424 commented on issue #10852:
URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991371544

   @xushiyan, @ad1happy2go and @codope could you please help me out with this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

2024-03-12 Thread via GitHub


jayesh2424 opened a new issue, #10852:
URL: https://github.com/apache/hudi/issues/10852

   I have a Hudi datalake in my AWS. Currently to have a ETL operation I 
usually use the full load of Hudi Datalake for my operations. I want to know 
how Can I have a particular set of data only from the Hudi datalake. 
   
   What I really want to achieve is a method like 
create_dynamic_frame.from_options(). Where we sent a samplequery to the 
database and fetch particular set of data only. Just like that I want to send a 
SQL query to the Hudi Datalake.
   
   The main goal is rather loading datalake and then filter out. I want to 
filter out the datalake and then load the particular part of datalake only. It 
will be great if I am able to load this particular data with help of spark sql.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7493] Consistent naming of Clean configuration parameters [hudi]

2024-03-12 Thread via GitHub


geserdugarov opened a new pull request, #10851:
URL: https://github.com/apache/hudi/pull/10851

   ### Change Logs
   
   `ConfigProperty.key()` and `ConfigOption.key()` are used for docs 
generating, and we need to move towards consistent naming of all parameters in 
Hudi. This MR proposes the first step, which is changing names of Cleaner 
configuration parameters only. The following table shows proposed changes.
   
   | New main key  | Previous key in `HoodieCleanConfig` | Previous key in 
`FlinkOptions` |
   | - | - | - |
   | hoodie.cleaner.automatic | hoodie.clean.automatic |  |
   | hoodie.cleaner.async.enabled | hoodie.clean.async | clean.async.enabled |
   | hoodie.cleaner.multiple.enabled | hoodie.clean.allow.multiple |  |
   | hoodie.cleaner.incremental.enabled | hoodie.cleaner.incremental.mode |  |
   | hoodie.cleaner.parallelism | hoodie.cleaner.parallelism |  |
   | hoodie.cleaner.policy | hoodie.cleaner.policy | clean.policy |
   | hoodie.cleaner.commits.retained | hoodie.cleaner.commits.retained | 
clean.retain_commits |
   | hoodie.cleaner.fileversions.retained | 
hoodie.cleaner.fileversions.retained | clean.retain_file_versions |
   | hoodie.cleaner.hours.retained | hoodie.cleaner.hours.retained | 
clean.retain_hours |
   | hoodie.cleaner.trigger.strategy | hoodie.clean.trigger.strategy |  |
   | hoodie.cleaner.trigger.max.commits | hoodie.clean.max.commits |  |
   | hoodie.cleaner.delete.bootstrap.base.file | 
hoodie.cleaner.delete.bootstrap.base.file |  |
   | hoodie.cleaner.failed.writes.policy | hoodie.cleaner.policy.failed.writes 
|  |
   
   Currently, we have about 900+ parameters in Hudi, as I mentioned in 
[HUDI-5738](https://issues.apache.org/jira/browse/HUDI-5738). Note, that my 
table contains only 899 of them, and missed, for instance, recently added TTL 
parameters.
   
   ### Impact
   
   To save backward compatibility, old keys are available, which is done by 
using `ConfigProperty.withAlternatives("...")` for `HoodieCleanConfig` or 
`ConfigOption.withFallbackKeys("...")` for `FlinkOptions`.
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   I will prepare corresponding MR with updates to the documentation.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7493) Clean configuration for clean service

2024-03-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7493:
-
Labels: pull-request-available  (was: )

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7493) Clean configuration for clean service

2024-03-12 Thread Geser Dugarov (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825613#comment-17825613
 ] 

Geser Dugarov commented on HUDI-7493:
-

Could be label by the ["Config Simplification" 
epic|https://issues.apache.org/jira/browse/HUDI-5738].

> Clean configuration for clean service
> -
>
> Key: HUDI-7493
> URL: https://issues.apache.org/jira/browse/HUDI-7493
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>
> Sometimes we use `{{{}hoodie.clean.*`{}}}  and sometimes 
> `{{{}hoodie.cleaner.*`.{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6806] Support Spark 3.5.0 [hudi]

2024-03-12 Thread via GitHub


ranwani commented on PR #9717:
URL: https://github.com/apache/hudi/pull/9717#issuecomment-1991259976

   @yihua : We need to use Hudi with Spark 3.5. Can you let me know when is 
Hudi 0.15.0 release planned? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


NishantBaheti commented on issue #10850:
URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991167509

   Hello, I am using this jar 
   - hudi-spark3.3-bundle_2.12-0.14.1.jar
   - spark 3.3
   - hudi 0.14.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on issue #10850:
URL: https://github.com/apache/hudi/issues/10850#issuecomment-1991109123

   Hi, @NishantBaheti , thanks for your feedback, could you also supplement the 
release version for Spark and Hudi respectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

2024-03-12 Thread via GitHub


danny0405 commented on code in PR #10727:
URL: https://github.com/apache/hudi/pull/10727#discussion_r1521102072


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##
@@ -202,7 +202,9 @@ private Option> 
composeSchemaEvolutionTrans
   Schema newWriterSchema = 
AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
   Schema writeSchemaFromFile = 
AvroInternalSchemaConverter.convert(writeInternalSchema, 
newWriterSchema.getFullName());
   boolean needToReWriteRecord = sameCols.size() != 
colNamesFromWriteSchema.size()
-  || 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+  && 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType()
+  == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+

Review Comment:
   So when the column size equals, there is no need to rewrite no matter 
whether the schema is compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

2024-03-12 Thread via GitHub


xiarixiaoyao commented on code in PR #10727:
URL: https://github.com/apache/hudi/pull/10727#discussion_r1521039170


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##
@@ -202,7 +202,9 @@ private Option> 
composeSchemaEvolutionTrans
   Schema newWriterSchema = 
AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
   Schema writeSchemaFromFile = 
AvroInternalSchemaConverter.convert(writeInternalSchema, 
newWriterSchema.getFullName());
   boolean needToReWriteRecord = sameCols.size() != 
colNamesFromWriteSchema.size()
-  || 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+  || 
!(SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType()

Review Comment:
   The original logic has certain performance issues 
   If the read-write schema is compatible, i think we no need rewrite the 
entire record.  since we can read  from old parquet file by new schema 
correctly. 
   
   @danny0405 @ThinkerLei 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Incremental query not working on COW table [hudi]

2024-03-12 Thread via GitHub


NishantBaheti opened a new issue, #10850:
URL: https://github.com/apache/hudi/issues/10850

   ## Error
   Error Category: QUERY_ERROR; AnalysisException: Found duplicate column(s) in 
the data schema: `_hoodie_commit_seqno`, `_hoodie_commit_time`, 
`_hoodie_file_name`, `_hoodie_partition_path`, `_hoodie_record_key`
   
   
   ## 
   hudi_options={
   'hoodie.datasource.query.type': 'incremental',
   'hoodie.datasource.read.begin.instanttime': start_time,
   'hoodie.datasource.read.end.instanttime': end_time,
 }
   df=spark.read\
.format("org.apache.hudi")\
.options(**hudi_options)\
.load(tablePath)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org