date:20230502

[GitHub] [hudi] hudi-bot commented on pull request #8624: HUDI-6165. Timeline Archival should consider earliest commit to retain from last completed clean

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8624:
URL: https://github.com/apache/hudi/pull/8624#issuecomment-1532534794

   
   ## CI report:
   
   * 476b53ff847fc5f31035df017d49ab5b8d24c8b2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16799)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8624: HUDI-6165. Timeline Archival should consider earliest commit to retain from last completed clean

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8624:
URL: https://github.com/apache/hudi/pull/8624#issuecomment-1532519407

   
   ## CI report:
   
   * 476b53ff847fc5f31035df017d49ab5b8d24c8b2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183258372


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)

Review Comment:
   sounds fair to cache `taggedHoodieRecords` as a whole (results from the 
first look up)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183290174


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java:
##
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.io.storage.HoodieFileReader;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.Schema;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.util.StringUtils.nonEmpty;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
+public class HoodieMergedReadHandle extends HoodieReadHandle {
+
+  protected final Schema readerSchema;
+
+  public HoodieMergedReadHandle(HoodieWriteConfig config,
+Option instantTime,
+HoodieTable hoodieTable,
+Pair partitionPathFileIDPair) {
+super(config, instantTime, hoodieTable, partitionPathFileIDPair);
+readerSchema = HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(config.getSchema()), 
config.allowOperationMetadataField());
+  }
+
+  public List> getMergedRecords() {
+Option fileSliceOpt = getLatestFileSlice();
+if (!fileSliceOpt.isPresent()) {
+  return Collections.emptyList();
+}
+checkState(nonEmpty(instantTime), String.format("Expected a valid instant 
time but got `%s`", instantTime));
+final FileSlice fileSlice = fileSliceOpt.get();
+final HoodieRecordLocation currentLocation = new 
HoodieRecordLocation(instantTime, fileSlice.getFileId());
+Option baseFileReader = Option.empty();
+HoodieMergedLogRecordScanner logRecordScanner = null;
+try {
+  baseFileReader = getBaseFileReader(fileSlice);
+  logRecordScanner = getLogRecordScanner(fileSlice);
+  List> mergedRecords = new ArrayList<>();
+  doMergedRead(baseFileReader, logRecordScanner).forEach(r -> {
+r.unseal();
+r.setCurrentLocation(currentLocation);
+r.seal();
+mergedRecords.add(r);
+  });
+  return mergedRecords;
+} catch (IOException e) {
+  throw new HoodieIndexException("Error in reading " + fileSlice, e);
+} finally {
+  if (baseFileReader.isPresent()) {
+baseFileReader.get().close();
+  }
+  if (logRecordScanner != null) {
+logRecordScanner.close();
+  }
+}
+  }
+
+  private Option getLatestFileSlice() {
+if (nonEmpty(instantTime)
+&& 
hoodieTable.getMetaClient().getCommitsTimeline().filterCompletedInstants().lastInstant().isPresent())
 {
+  return Option.fromJavaOptional(hoodieTable
+  .getHoodieView()
+  
.getLatestMergedFileSlicesBeforeOrOn(partitionPathFileIDPair.getLeft(), 
instantTime)
+  .filter(fileSlice -> 
fileSlice.getFileId().equals(partitionPathFileIDPair.getRight()))
+  .findFirst());
+}
+return Option.empty();
+  }
+
+  private Option getBaseFileReader(FileSlice fileSlice) 
throws IOException {
+if (fileSlice.getBaseFile().isPresent()) {
+  return Option.of(createNewFileReader(fileSlice.getBaseFile().get()));
+}
+return Option.empty();
+  }
+
+  private HoodieMergedLogRecordScanner getLogRecordScanner(FileSlice 
fileSlice) {
+List logFilePaths = fileSlice.getLo

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183288652


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)
+.distinctWithKey(HoodieRecord::getRecordKey, 
config.getGlobalIndexReconcileParallelism());
+// the tagging partitions and locations
+HoodieData> partitionLocations = 
taggedHoodieRecords
+.filter(p -> p.getRight().isPresent())
+.map(p -> p.getRight().get())
+.distinct(config.getGlobalIndexReconcileParallelism());
+// merged existing records with current locations being set
+HoodieData> existingRecords = 
getTaggedRecordsFromPartitionLocations(partitionLocations, config, hoodieTable);
+
+TypedProperties updatedProps = 
HoodieAvroRecordMerger.Config.withLegacyOperatingModePreCombining(config.getProps());
+HoodieData> taggedUpdatingRecords = 
updatingRecords.mapToPair(r -> Pair.of(r.getRecordKey(), r))
+.leftOuterJoin(existingRecords.mapToPair(r -> 
Pair.of(r.getRecordKey(), r)))
+.values().flatMap(entry -> {
+  HoodieRecord incoming = entry.getLeft();
+  Option> existingOpt = entry.getRight();
+  if (!existingOpt.isPresent()) {
+// existing record not found (e.g., due to delete log not merged 
to base file): tag as a new record
+return Collections.singletonList(getTaggedRecord(incoming, 
Option.empty())).iterator();
+  }
+  HoodieRecord existing = existingOpt.get();
+  if (incoming.getData() instanceof EmptyHoodieRecordPayload) {
+// incoming is a delete: force tag the incoming to the old 
partition
+return Collections.singletonList(getTaggedRecord(incoming, 
Option.of(existing.getCurrentLocation(.iterator();
+  }
+  Schema existingSchema = HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(config.getSchema()), 
config.allowOperationMetadataField());

Review Comment:
   separating those cases out does not help re-use as it's only used here. it's 
actually easier to see all "return" cases in 1 block to have the full picture.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183282840


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(

Review Comment:
   ok sounds fair to call it `getExistingRecords()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6165) Timeline Archival should consider earliest commit to retain from last completed clean

2023-05-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6165:
-
Labels: pull-request-available  (was: )

> Timeline Archival should consider earliest commit to retain from last 
> completed clean
> -
>
> Key: HUDI-6165
> URL: https://issues.apache.org/jira/browse/HUDI-6165
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Timeline Archival should consider earliest commit to retain from last 
> completed clean instant as well. Currently it archives only till earliest 
> commit for the first pending clean instant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] lokeshj1703 opened a new pull request, #8624: HUDI-6165. Timeline Archival should consider earliest commit to retain from last completed clean

2023-05-02 Thread via GitHub



lokeshj1703 opened a new pull request, #8624:
URL: https://github.com/apache/hudi/pull/8624

   ### Change Logs
   
   Timeline Archival should consider earliest commit to retain from last 
completed clean instant as well. Currently it archives only till earliest 
commit for the first pending clean instant.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6165) Timeline Archival should consider earliest commit to retain from last completed clean

2023-05-02 Thread Lokesh Jain (Jira)

Lokesh Jain created HUDI-6165:
-

 Summary: Timeline Archival should consider earliest commit to 
retain from last completed clean
 Key: HUDI-6165
 URL: https://issues.apache.org/jira/browse/HUDI-6165
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Timeline Archival should consider earliest commit to retain from last completed 
clean instant as well. Currently it archives only till earliest commit for the 
first pending clean instant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on pull request #8107: [HUDI-5514][HUDI-5574][HUDI-5604][HUDI-5535] Adding auto generation of record keys support to Hudi/Spark

2023-05-02 Thread via GitHub



danny0405 commented on PR #8107:
URL: https://github.com/apache/hudi/pull/8107#issuecomment-1532483205

   > Need closure on wrapping the key generator impl as @danny0405 was 
suggesting, among other things.
   > 
   > @danny0405 we need to do this for Flink as well. thoughts?
   
   Flink already impl the keyless use case for a whole diggerent code path, it 
is already wrapping the key generator in current code base.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on pull request #8452: [HUDI-6077] Add more partition push down filters

2023-05-02 Thread via GitHub



bvaradar commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1532481200

   @boneanxs : Can you also look at failing test 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a diff in pull request #8452: [HUDI-6077] Add more partition push down filters

2023-05-02 Thread via GitHub



bvaradar commented on code in PR #8452:
URL: https://github.com/apache/hudi/pull/8452#discussion_r1183257664


##
hudi-common/src/main/java/org/apache/hudi/expression/Predicates.java:
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.expression;
+
+import org.apache.hudi.internal.schema.Type;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class Predicates {
+
+  public static True alwaysTrue() {
+return True.get();
+  }
+
+  public static False alwaysFalse() {
+return False.get();
+  }
+
+  public static And and(Expression left, Expression right) {
+return new And(left, right);
+  }
+
+  public static Or or(Expression left, Expression right) {
+return new Or(left, right);
+  }
+
+  public static BinaryComparison gt(Expression left, Expression right) {
+return new BinaryComparison(left, Expression.Operator.GT, right);
+  }
+
+  public static BinaryComparison lt(Expression left, Expression right) {
+return new BinaryComparison(left, Expression.Operator.LT, right);
+  }
+
+  public static BinaryComparison eq(Expression left, Expression right) {
+return new BinaryComparison(left, Expression.Operator.EQ, right);
+  }
+
+  public static BinaryComparison gteq(Expression left, Expression right) {
+return new BinaryComparison(left, Expression.Operator.GT_EQ, right);
+  }
+
+  public static BinaryComparison lteq(Expression left, Expression right) {
+return new BinaryComparison(left, Expression.Operator.LT_EQ, right);
+  }
+
+  public static BinaryComparison startsWith(Expression left, Expression right) 
{
+return new BinaryComparison(left, Expression.Operator.STARTS_WITH, right);
+  }
+
+  public static StringContains contains(Expression left, Expression right) {
+return new StringContains(left, right);
+  }
+
+  public static In in(Expression left, List validExpressions) {
+return new In(left, validExpressions);
+  }
+
+  public static IsNull isNull(Expression child) {
+return new IsNull(child);
+  }
+
+  public static IsNotNull isNotNull(Expression child) {
+return new IsNotNull(child);
+  }
+
+  public static Not not(Expression expr) {
+return new Not(expr);
+  }
+
+  public static class True extends LeafExpression implements Predicate {

Review Comment:
   Rename True to TrueExpression as it is very similar to Boolean type. Same 
for False 



##
hudi-common/src/main/java/org/apache/hudi/expression/ArrayData.java:
##
@@ -16,22 +16,25 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.hive.expression;

Review Comment:
   Any specific reason why we are renaming packages ?



##
hudi-common/src/main/java/org/apache/hudi/expression/BindVisitor.java:
##
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.expression;
+
+import org.apache.hudi.internal.schema.Types;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class BindVisitor implements ExpressionVisitor  {
+
+  protected final Types.RecordType recordType;
+  protected final boolean caseSensitive;
+
+  public BindVisitor(Types.RecordType recordType, boolean caseSensitive) {
+this.recordType = recordType;
+this.caseSensitive = caseSensitive;
+  }
+
+  @Override
+

[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-05-02 Thread via GitHub



danny0405 commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1532480189

   Okay, finally I find out the reason for the failure from test 
`TestNestedSchemaPruningOptimization`.
   
   It is because we hard code the Parquet vectorized reader in base realtion: 
https://github.com/apache/hudi/blob/620f39a5fd5e1392819d530ea963f866c3f1c301/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala#L78
 in https://github.com/apache/hudi/pull/5168,
   
   after we upgrade to Spark 3.3.2, the whole stage codegen triggered in Spark 
physical plan during the code generation, and the whole stage code gen id doing 
the code generation based on per-row assumption (no vectorized reader 
supported).
   
   I have created a patch to fix this(also fix the compilure error for 
hudi-sync module). The patch disable the vectorized reader, but I'm not sure 
how it would impact the performance, would ping @xiarixiaoyao for a review ~
   
   
[5868.patch.zip](https://github.com/apache/hudi/files/11379613/5868.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183258372


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)

Review Comment:
   sounds fair to cache `taggedHoodieRecords` as a whole (results from the 
first look up)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183253432


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)
+.distinctWithKey(HoodieRecord::getRecordKey, 
config.getGlobalIndexReconcileParallelism());

Review Comment:
   the tagged records at this point will contain dups in case of last write 
updated partition and inserted a new record to new partition, and compaction 
has not happened yet. The first look up will still get 2 records due to join 
only with base files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183252279


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java:
##
@@ -110,33 +110,31 @@ protected  HoodieData> 
tagLocationBacktoRecords(
 keyLocationPairs.mapToPair(p -> new ImmutablePair<>(
 p.getKey().getRecordKey(), new ImmutablePair<>(p.getValue(), 
p.getKey(;
 
+// Pair of a tagged record and if the record needs dedup

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mulavar commented on a diff in pull request #8385: [HUDI-6040]Stop writing and reading compaction plans from .aux folder

2023-05-02 Thread via GitHub



Mulavar commented on code in PR #8385:
URL: https://github.com/apache/hudi/pull/8385#discussion_r1183246091


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Downgrade handle to assist in downgrading hoodie table from version 6 to 5.
+ * To ensure compatibility, we need recreate the compaction requested file to
+ * .aux folder.
+ */
+public class SixToFiveDowngradeHandler implements DowngradeHandler {
+
+  @Override
+  public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+HoodieTable table = upgradeDowngradeHelper.getTable(config, context);
+HoodieTableMetaClient metaClient = table.getMetaClient();
+// sync compaction requested file to .aux
+HoodieTimeline compactionTimeline = new HoodieActiveTimeline(metaClient, 
false).filterPendingCompactionTimeline()
+.filter(instant -> instant.getState() == 
HoodieInstant.State.REQUESTED);

Review Comment:
   We have fixed this problem by setting `applyLayoutFilters` to false when 
creating the HoodieActiveTimeline, filterPendingCompactionTimeline() would 
return both inflight and requested instants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Mulavar commented on a diff in pull request #8385: [HUDI-6040]Stop writing and reading compaction plans from .aux folder

2023-05-02 Thread via GitHub



Mulavar commented on code in PR #8385:
URL: https://github.com/apache/hudi/pull/8385#discussion_r1183246091


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Downgrade handle to assist in downgrading hoodie table from version 6 to 5.
+ * To ensure compatibility, we need recreate the compaction requested file to
+ * .aux folder.
+ */
+public class SixToFiveDowngradeHandler implements DowngradeHandler {
+
+  @Override
+  public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+HoodieTable table = upgradeDowngradeHelper.getTable(config, context);
+HoodieTableMetaClient metaClient = table.getMetaClient();
+// sync compaction requested file to .aux
+HoodieTimeline compactionTimeline = new HoodieActiveTimeline(metaClient, 
false).filterPendingCompactionTimeline()
+.filter(instant -> instant.getState() == 
HoodieInstant.State.REQUESTED);

Review Comment:
   > Thanks, nice catch! fixed.
   We have fixed this problem by setting `applyLayoutFilters` to false, 
filterPendingCompactionTimeline() would return both inflight and requested 
instants.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8385: [HUDI-6040]Stop writing and reading compaction plans from .aux folder

2023-05-02 Thread via GitHub



nsivabalan commented on code in PR #8385:
URL: https://github.com/apache/hudi/pull/8385#discussion_r1183240534


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SixToFiveDowngradeHandler.java:
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.upgrade;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.Collections;
+import java.util.Map;
+
+/**
+ * Downgrade handle to assist in downgrading hoodie table from version 6 to 5.
+ * To ensure compatibility, we need recreate the compaction requested file to
+ * .aux folder.
+ */
+public class SixToFiveDowngradeHandler implements DowngradeHandler {
+
+  @Override
+  public Map downgrade(HoodieWriteConfig config, 
HoodieEngineContext context, String instantTime, SupportsUpgradeDowngrade 
upgradeDowngradeHelper) {
+HoodieTable table = upgradeDowngradeHelper.getTable(config, context);
+HoodieTableMetaClient metaClient = table.getMetaClient();
+// sync compaction requested file to .aux
+HoodieTimeline compactionTimeline = new HoodieActiveTimeline(metaClient, 
false).filterPendingCompactionTimeline()
+.filter(instant -> instant.getState() == 
HoodieInstant.State.REQUESTED);

Review Comment:
   did we follow up on this. i.e. if we have a compaction instant that is 
inflight, does filterPendingCompactionTimeline() return both inflight and 
requested or just the inflight? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6164) Create new version for RawTripTestPayload to avoid misuse

2023-05-02 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-6164:


 Summary: Create new version for RawTripTestPayload to avoid misuse
 Key: HUDI-6164
 URL: https://issues.apache.org/jira/browse/HUDI-6164
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Raymond Xu


org.apache.hudi.common.testutils.HoodieTestDataGenerator has been omitting 
ordering value when creating RawTripTestPayload, as a result, the records 
generated are not usable for merging. This involves these 2 constructors


{code:java}
org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(org.apache.hudi.common.util.Option,
 java.lang.String, java.lang.String, java.lang.String, java.lang.Boolean, 
java.lang.Comparable)
org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(java.lang.String,
 java.lang.String, java.lang.String, java.lang.String)
{code}


On the other hand, there are test cases construct RawTripTestPayload with json 
data directly and fix the partition field as `time` using this constructor

{code:java}
org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(java.lang.String)
{code}

These are contradicting usage of this class. We should create another payload 
class for the 2nd use case (fixed simple schema). And make RawTripTestPayload 
support setting ordering value with HoodieTestDataGenerator.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6164) Create new version for RawTripTestPayload to avoid misuse

2023-05-02 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-6164:
-
Sprint: Sprint 2023-04-10

> Create new version for RawTripTestPayload to avoid misuse
> -
>
> Key: HUDI-6164
> URL: https://issues.apache.org/jira/browse/HUDI-6164
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Raymond Xu
>Priority: Major
>
> org.apache.hudi.common.testutils.HoodieTestDataGenerator has been omitting 
> ordering value when creating RawTripTestPayload, as a result, the records 
> generated are not usable for merging. This involves these 2 constructors
> {code:java}
> org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(org.apache.hudi.common.util.Option,
>  java.lang.String, java.lang.String, java.lang.String, java.lang.Boolean, 
> java.lang.Comparable)
> org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(java.lang.String,
>  java.lang.String, java.lang.String, java.lang.String)
> {code}
> On the other hand, there are test cases construct RawTripTestPayload with 
> json data directly and fix the partition field as `time` using this 
> constructor
> {code:java}
> org.apache.hudi.common.testutils.RawTripTestPayload#RawTripTestPayload(java.lang.String)
> {code}
> These are contradicting usage of this class. We should create another payload 
> class for the 2nd use case (fixed simple schema). And make RawTripTestPayload 
> support setting ordering value with HoodieTestDataGenerator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] xushiyan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



xushiyan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183239844


##
hudi-common/src/test/java/org/apache/hudi/common/testutils/RawTripTestPayload.java:
##
@@ -76,8 +85,29 @@ public RawTripTestPayload(String jsonData) throws 
IOException {
 this.dataSize = jsonData.length();
 Map jsonRecordMap = OBJECT_MAPPER.readValue(jsonData, 
Map.class);
 this.rowKey = jsonRecordMap.get("_row_key").toString();
-this.partitionPath = 
jsonRecordMap.get("time").toString().split("T")[0].replace("-", "/");
+this.partitionPath = 
extractPartitionFromTimeField(jsonRecordMap.get("time").toString());
 this.isDeleted = false;
+this.orderingVal = Integer.valueOf(jsonRecordMap.getOrDefault("number", 
0L).toString());
+  }
+
+  public RawTripTestPayload(GenericRecord record, Comparable orderingVal) {

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-6164



##
hudi-common/src/test/java/org/apache/hudi/common/testutils/RawTripTestPayload.java:
##
@@ -86,7 +87,7 @@ public RawTripTestPayload(String jsonData) throws IOException 
{
 this.rowKey = jsonRecordMap.get("_row_key").toString();
 this.partitionPath = 
extractPartitionFromTimeField(jsonRecordMap.get("time").toString());
 this.isDeleted = false;
-this.orderingVal = Integer.valueOf(jsonRecordMap.get("number").toString());
+this.orderingVal = Integer.valueOf(jsonRecordMap.getOrDefault("number", 
0L).toString());

Review Comment:
   https://issues.apache.org/jira/browse/HUDI-6164



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



nsivabalan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183226990


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java:
##
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordMerger;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.io.storage.HoodieFileReader;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.Schema;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import static java.util.stream.Collectors.toList;
+import static org.apache.hudi.common.util.StringUtils.nonEmpty;
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
+public class HoodieMergedReadHandle extends HoodieReadHandle {
+
+  protected final Schema readerSchema;
+
+  public HoodieMergedReadHandle(HoodieWriteConfig config,
+Option instantTime,
+HoodieTable hoodieTable,
+Pair partitionPathFileIDPair) {
+super(config, instantTime, hoodieTable, partitionPathFileIDPair);
+readerSchema = HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(config.getSchema()), 
config.allowOperationMetadataField());
+  }
+
+  public List> getMergedRecords() {
+Option fileSliceOpt = getLatestFileSlice();
+if (!fileSliceOpt.isPresent()) {
+  return Collections.emptyList();
+}
+checkState(nonEmpty(instantTime), String.format("Expected a valid instant 
time but got `%s`", instantTime));
+final FileSlice fileSlice = fileSliceOpt.get();
+final HoodieRecordLocation currentLocation = new 
HoodieRecordLocation(instantTime, fileSlice.getFileId());
+Option baseFileReader = Option.empty();
+HoodieMergedLogRecordScanner logRecordScanner = null;
+try {
+  baseFileReader = getBaseFileReader(fileSlice);
+  logRecordScanner = getLogRecordScanner(fileSlice);
+  List> mergedRecords = new ArrayList<>();
+  doMergedRead(baseFileReader, logRecordScanner).forEach(r -> {
+r.unseal();
+r.setCurrentLocation(currentLocation);
+r.seal();
+mergedRecords.add(r);
+  });
+  return mergedRecords;
+} catch (IOException e) {
+  throw new HoodieIndexException("Error in reading " + fileSlice, e);
+} finally {
+  if (baseFileReader.isPresent()) {
+baseFileReader.get().close();
+  }
+  if (logRecordScanner != null) {
+logRecordScanner.close();
+  }
+}
+  }
+
+  private Option getLatestFileSlice() {
+if (nonEmpty(instantTime)
+&& 
hoodieTable.getMetaClient().getCommitsTimeline().filterCompletedInstants().lastInstant().isPresent())
 {
+  return Option.fromJavaOptional(hoodieTable
+  .getHoodieView()
+  
.getLatestMergedFileSlicesBeforeOrOn(partitionPathFileIDPair.getLeft(), 
instantTime)
+  .filter(fileSlice -> 
fileSlice.getFileId().equals(partitionPathFileIDPair.getRight()))
+  .findFirst());
+}
+return Option.empty();
+  }
+
+  private Option getBaseFileReader(FileSlice fileSlice) 
throws IOException {
+if (fileSlice.getBaseFile().isPresent()) {
+  return Option.of(createNewFileReader(fileSlice.getBaseFile().get()));
+}
+return Option.empty();
+  }
+
+  private HoodieMergedLogRecordScanner getLogRecordScanner(FileSlice 
fileSlice) {
+List logFilePaths = fileSlice.get

[GitHub] [hudi] bvaradar commented on pull request #7922: [HUDI-5578] Upgrade base docker image for java 8

2023-05-02 Thread via GitHub



bvaradar commented on PR #7922:
URL: https://github.com/apache/hudi/pull/7922#issuecomment-1532430453

   @kazdy : Is this PR still required ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #8609: [HUDI-6154] Introduced rety while reading hoodie.properties to deal with parallel updates.

2023-05-02 Thread via GitHub



CTTY commented on code in PR #8609:
URL: https://github.com/apache/hudi/pull/8609#discussion_r1183212621


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -323,22 +326,43 @@ public HoodieTableConfig() {
 super();
   }
 
-  private void fetchConfigs(FileSystem fs, String metaPath) throws IOException 
{
+  private static TypedProperties fetchConfigs(FileSystem fs, String metaPath) 
throws IOException {
 Path cfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE);
-try (FSDataInputStream is = fs.open(cfgPath)) {
-  props.load(is);
-} catch (IOException ioe) {
-  if (!fs.exists(cfgPath)) {
-LOG.warn("Run `table recover-configs` if config update/delete failed 
midway. Falling back to backed up configs.");
-// try the backup. this way no query ever fails if update fails midway.
-Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP);
-try (FSDataInputStream is = fs.open(backupCfgPath)) {
+Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP);
+int readRetryCount = 0;
+boolean found = false;
+
+TypedProperties props = new TypedProperties();
+while (readRetryCount++ < MAX_READ_RETRIES) {
+  for (Path path : Arrays.asList(cfgPath, backupCfgPath)) {

Review Comment:
   Using a list and return here is much cleaner, brilliant



##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -323,22 +326,43 @@ public HoodieTableConfig() {
 super();
   }
 
-  private void fetchConfigs(FileSystem fs, String metaPath) throws IOException 
{
+  private static TypedProperties fetchConfigs(FileSystem fs, String metaPath) 
throws IOException {
 Path cfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE);
-try (FSDataInputStream is = fs.open(cfgPath)) {
-  props.load(is);
-} catch (IOException ioe) {
-  if (!fs.exists(cfgPath)) {
-LOG.warn("Run `table recover-configs` if config update/delete failed 
midway. Falling back to backed up configs.");
-// try the backup. this way no query ever fails if update fails midway.
-Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP);
-try (FSDataInputStream is = fs.open(backupCfgPath)) {
+Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP);
+int readRetryCount = 0;
+boolean found = false;
+
+TypedProperties props = new TypedProperties();
+while (readRetryCount++ < MAX_READ_RETRIES) {
+  for (Path path : Arrays.asList(cfgPath, backupCfgPath)) {
+// Read the properties and validate that it is a valid file
+try (FSDataInputStream is = fs.open(path)) {
+  props.clear();
   props.load(is);
+  found = true;
+  ValidationUtils.checkArgument(props.containsKey(TYPE) && 
props.containsKey(NAME));
+  return props;
+} catch (IOException e) {
+  LOG.warn(String.format("Could not read properties from %s: %s", 
path, e));
+} catch (IllegalArgumentException e) {
+  LOG.warn(String.format("Invalid properties file %s: %s", path, 
props));
 }
-  } else {
-throw ioe;
+  }
+
+  // Failed to read all files so wait before retrying. This can happen in 
cases of parallel updates to the properties.
+  try {
+Thread.sleep(READ_RETRY_DELAY_MSEC);
+  } catch (InterruptedException e) {
+LOG.warn("Interrupted while waiting");
   }
 }
+
+// If we are here than after all retries either no hoodie.properties was 
found or only an invalid file was found.
+if (found) {
+  throw new IllegalArgumentException("hoodie.properties file seems 
invalid. Please check for left over `.updated` files if any, manually copy it 
to hoodie.properties and retry");
+} else {
+  throw new HoodieIOException("Failed to read hoodie properties");
+}

Review Comment:
   If `IllegalArgumentException` can be handled earlier, then we won't have to 
have this logic here as well



##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -323,22 +326,43 @@ public HoodieTableConfig() {
 super();
   }
 
-  private void fetchConfigs(FileSystem fs, String metaPath) throws IOException 
{
+  private static TypedProperties fetchConfigs(FileSystem fs, String metaPath) 
throws IOException {
 Path cfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE);
-try (FSDataInputStream is = fs.open(cfgPath)) {
-  props.load(is);
-} catch (IOException ioe) {
-  if (!fs.exists(cfgPath)) {
-LOG.warn("Run `table recover-configs` if config update/delete failed 
midway. Falling back to backed up configs.");
-// try the backup. this way no query ever fails if update fails midway.
-Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP);
-try (FSDataInputStream is = fs.open(backupCf

[GitHub] [hudi] danny0405 commented on a diff in pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.

2023-05-02 Thread via GitHub



danny0405 commented on code in PR #8604:
URL: https://github.com/apache/hudi/pull/8604#discussion_r1183210983


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##
@@ -161,27 +161,28 @@ protected void commit(String instantTime, 
Map alreadyCompletedInstant = 
metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
-> entry.getTimestamp().equals(instantTime)).lastInstant();
-if (alreadyCompletedInstant.isPresent()) {
-  // this code path refers to a re-attempted commit that got committed 
to metadata table, but failed in datatable.
-  // for eg, lets say compaction c1 on 1st attempt succeeded in 
metadata table and failed before committing to datatable.
-  // when retried again, data table will first rollback pending 
compaction. these will be applied to metadata table, but all changes
-  // are upserts to metadata table and so only a new delta commit will 
be created.
-  // once rollback is complete, compaction will be retried again, 
which will eventually hit this code block where the respective commit is
-  // already part of completed commit. So, we have to manually remove 
the completed instant and proceed.
-  // and it is for the same reason we enabled 
withAllowMultiWriteOnSameInstant for metadata table.
-  HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
-  metadataMetaClient.reloadActiveTimeline();
+LOG.info(String.format("%s completed commit at %s being applied to 
metadata table",
+alreadyCompletedInstant.isPresent() ? "Already" : "Partially", 
instantTime));
+
+// Rollback the previous committed commit
+if (!writeClient.rollback(instantTime)) {

Review Comment:
   Yeah, fix the rollback in sync with normal DT can avoid many potential bugs, 
+1 for this direction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #8606: [MINOR] Check the return value from delete during rollback and finalize to ensure the files actually got deleted.

2023-05-02 Thread via GitHub



danny0405 commented on PR #8606:
URL: https://github.com/apache/hudi/pull/8606#issuecomment-1532407484

   Can you help checking the test failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8617: [SUPPORT] MapType support in HUDI

2023-05-02 Thread via GitHub



danny0405 commented on issue #8617:
URL: https://github.com/apache/hudi/issues/8617#issuecomment-1532405416

   You can use the COW table type, which would only write PARQUETs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on a diff in pull request #7826: [HUDI-5675] fix lazy clean schedule rollback on completed instant

2023-05-02 Thread via GitHub



stream2000 commented on code in PR #7826:
URL: https://github.com/apache/hudi/pull/7826#discussion_r1183190244


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java:
##
@@ -707,20 +709,34 @@ protected List 
getInstantsToRollback(HoodieTableMetaClient metaClient, H
 }
   }).collect(Collectors.toList());
 } else if (cleaningPolicy.isLazy()) {
-  return inflightInstantsStream.filter(instant -> {
-try {
-  return heartbeatClient.isHeartbeatExpired(instant.getTimestamp());
-} catch (IOException io) {
-  throw new HoodieException("Failed to check heartbeat for instant " + 
instant, io);
-}
-  }).map(HoodieInstant::getTimestamp).collect(Collectors.toList());
+  return getInstantsToRollbackForLazyCleanPolicy(metaClient, 
inflightInstantsStream);
 } else if (cleaningPolicy.isNever()) {
   return Collections.emptyList();
 } else {
   throw new IllegalArgumentException("Invalid Failed Writes Cleaning 
Policy " + config.getFailedWritesCleanPolicy());
 }
   }
 
+  @VisibleForTesting
+  public List 
getInstantsToRollbackForLazyCleanPolicy(HoodieTableMetaClient metaClient,
+  
Stream inflightInstantsStream) {
+// Get expired instants, must store them into list before double-checking
+List expiredInstants = 
inflightInstantsStream.filter(instant -> {
+  try {
+// An instant transformed from inflight to completed have no heartbeat 
file and will be detected as expired instant here
+return heartbeatClient.isHeartbeatExpired(instant.getTimestamp());
+  } catch (IOException io) {
+throw new HoodieException("Failed to check heartbeat for instant " + 
instant, io);
+  }
+}).collect(Collectors.toList());
+
+// Only return instants that haven't been completed by other writers
+return expiredInstants.stream()
+.filter(instant -> 
!metaClient.getActiveTimeline().isCompletedCommitFileExists(instant))

Review Comment:
   Agreed. Lazy clean is usually an async operation and reloading the timeline 
will not cost too much time if archiving works as expected. We don't need to 
introduce a new api`isCompletedCommitFileExists` here just to reduce the cost 
of reloading the timeline in lazy clean.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

2023-05-02 Thread via GitHub



danny0405 commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1532370377

   > but I encountered an issue where I was unable to read the data back and 
received a null pointer exception
   
   This patch should solve your issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #8190: [HUDI-5936] Fix serialization problem when FileStatus is not serializable

2023-05-02 Thread via GitHub



CTTY commented on code in PR #8190:
URL: https://github.com/apache/hudi/pull/8190#discussion_r1183184527


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -106,9 +106,9 @@ private List getPartitionPathWithPathPrefix(String 
relativePathPrefix) t
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
   // List all directories in parallel
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  List> dirToFileListing = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+return Arrays.stream(fileSystem.listStatus(path)).map(fileStatus -> 
Pair.of(fileStatus.getPath(), fileStatus.isDirectory()));
   }, listingParallelism);

Review Comment:
   Yes, I'm from EMR and we reviewed EMRFS with its owner. It would be very 
tricky to fix this on FS level and it would make more sense to fix this within 
Hudi to make sure objects are serializable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #8190: [HUDI-5936] Fix serialization problem when FileStatus is not serializable

2023-05-02 Thread via GitHub



CTTY commented on code in PR #8190:
URL: https://github.com/apache/hudi/pull/8190#discussion_r1183184527


##
hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java:
##
@@ -106,9 +106,9 @@ private List getPartitionPathWithPathPrefix(String 
relativePathPrefix) t
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
   // List all directories in parallel
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  List> dirToFileListing = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+return Arrays.stream(fileSystem.listStatus(path)).map(fileStatus -> 
Pair.of(fileStatus.getPath(), fileStatus.isDirectory()));
   }, listingParallelism);

Review Comment:
   Yes, I'm from EMR and we reviewed EMRFS with their owner. It would be very 
tricky to fix this on FS level and it would make more sense to fix this within 
Hudi to make sure objects are serializable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8514: [HUDI-6113] Support multiple transformers using the same config keys in DeltaStreamer

2023-05-02 Thread via GitHub



nsivabalan commented on code in PR #8514:
URL: https://github.com/apache/hudi/pull/8514#discussion_r1183180023


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:
##
@@ -276,7 +276,17 @@ public static class Config implements Serializable {
 + ". Allows transforming raw source Dataset to a target Dataset 
(conforming to target schema) before "
 + "writing. Default : Not set. E:g - 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which "
 + "allows a SQL query templated to be passed as a transformation 
function). "
-+ "Pass a comma-separated list of subclass names to chain the 
transformations.")
++ "Pass a comma-separated list of subclass names to chain the 
transformations. Transformer can also include "
++ "an identifier. E:g - 
tr1:org.apache.hudi.utilities.transform.SqlQueryBasedTransformer. Here the 
identifier tr1 "
++ "can be used along with property key like 
`hoodie.deltastreamer.transformer.sql.tr1` to identify properties related "
++ "to the transformer. So effective value for 
`hoodie.deltastreamer.transformer.sql` is determined by key "
++ "`hoodie.deltastreamer.transformer.sql.tr1` for this 
transformer. This is useful when there are two or more "
++ "transformers using the same config keys and expect different 
values for those keys. If identifier is used, it should "
++ "be specified for all the transformers. Further the order in 
which transformer is applied is determined by the occurrence "
++ "of transformer irrespective of the identifier used for the 
transformer. For example: In the configured value below "
++ 
"tr2:org.apache.hudi.utilities.transform.SqlQueryBasedTransformer,tr1:org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
 "
++ ", tr2 is applied before tr1 based on order of occurrence."
+)

Review Comment:
   can we call out that this identifier format is not strictly required unless 
users have a requirement to have multiple transformers of the same type. 



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java:
##
@@ -19,36 +19,137 @@
 package org.apache.hudi.utilities.transform;
 
 import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
 
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
 
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.HashSet;
 import java.util.List;
+import java.util.Map;
+import java.util.Set;
 import java.util.stream.Collectors;
 
 /**
  * A {@link Transformer} to chain other {@link Transformer}s and apply 
sequentially.
  */
 public class ChainedTransformer implements Transformer {
 
-  private List transformers;
+  // Delimiter used to separate class name and the property key suffix. The 
suffix comes first.
+  private static final String TRANSFORMER_CLASS_NAME_ID_DELIMITER = ":";
 
-  public ChainedTransformer(List transformers) {
-this.transformers = transformers;
+  private final List transformers;
+
+  public ChainedTransformer(List transformersList) {
+this.transformers = new ArrayList<>(transformersList.size());
+for (Transformer transformer : transformersList) {
+  this.transformers.add(new TransformerInfo(transformer));
+}
+  }
+
+  /**
+   * Creates a chained transformer using the input transformer class names. 
Refer {@link HoodieDeltaStreamer.Config#transformerClassNames}
+   * for more information on how the transformers can be configured.
+   *
+   * @param configuredTransformers List of configured transformer class names.
+   * @param ignore Added for avoiding two methods with same erasure. Ignored.
+   */
+  public ChainedTransformer(List configuredTransformers, int... 
ignore) {
+this.transformers = new ArrayList<>(configuredTransformers.size());
+
+Set identifiers = new HashSet<>();
+for (String configuredTransformer : configuredTransformers) {
+  if 
(!configuredTransformer.contains(TRANSFORMER_CLASS_NAME_ID_DELIMITER)) {
+transformers.add(new 
TransformerInfo(ReflectionUtils.loadClass(configuredTransformer)));
+  } else {
+String[] splits = 
configuredTransformer.split(TRANSFORMER_CLASS_NAME_ID_DELIMITER);
+if (splits.length > 2) {
+  throw new IllegalArgumentException("There should only be one colon 
in a configured transformer");
+}
+String id = splits[0];
+validateIdentifier(id, identifiers, configuredTransformer);
+

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8490: [HUDI-5968] Fix global index duplicate and handle custom payload when update partition

2023-05-02 Thread via GitHub



nsivabalan commented on code in PR #8490:
URL: https://github.com/apache/hudi/pull/8490#discussion_r1183159599


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)
+.distinctWithKey(HoodieRecord::getRecordKey, 
config.getGlobalIndexReconcileParallelism());

Review Comment:
   I see we are doing distinctWithKey here. So, we assume that records may not 
be duplicated at all?
   what happens if there are duplicates already. for eg, some one ingested same 
batch of data w/ bulk_insert. may be we need to revisit overall end to end flow 
for this scenario of how our global index will work. 
   but trying to think through how it might surface after this fix? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##
@@ -175,4 +186,72 @@ public static boolean checkIfValidCommit(HoodieTimeline 
commitTimeline, String c
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty() && 
commitTimeline.containsOrBeforeTimelineStarts(commitTs);
   }
+
+  public static  HoodieData> 
getTaggedRecordsFromPartitionLocations(
+  HoodieData> partitionLocations, 
HoodieWriteConfig config, HoodieTable hoodieTable) {
+final Option instantTime = hoodieTable
+.getMetaClient()
+.getCommitsTimeline()
+.filterCompletedInstants()
+.lastInstant()
+.map(HoodieInstant::getTimestamp);
+return partitionLocations.flatMap(p -> {
+  String partitionPath = p.getLeft();
+  String fileId = p.getRight().getFileId();
+  return new HoodieMergedReadHandle(config, instantTime, hoodieTable, 
Pair.of(partitionPath, fileId))
+  .getMergedRecords().iterator();
+});
+  }
+
+  public static  HoodieData> mergeForPartitionUpdates(
+  HoodieData, Option>>> taggedHoodieRecords, HoodieWriteConfig config, 
HoodieTable hoodieTable) {
+// completely new records
+HoodieData> newRecords = taggedHoodieRecords.filter(p -> 
!p.getRight().isPresent()).map(Pair::getLeft);
+// the records tagged to existing base files
+HoodieData> updatingRecords = taggedHoodieRecords.filter(p 
-> p.getRight().isPresent()).map(Pair::getLeft)

Review Comment:
   I see we process taggedHoodieRecords.filter(p -> p.getRight().isPresent()) 
multiple times. should we take it out and cache it? 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergedReadHandle.java:
##
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLoca

[GitHub] [hudi] alberttwong commented on issue #8623: [SUPPORT] Following Docker Demo Quickstart with OpenJDK 1.8 on Mac arm64

2023-05-02 Thread via GitHub



alberttwong commented on issue #8623:
URL: https://github.com/apache/hudi/issues/8623#issuecomment-1532279447

   ```
   atwong@Alberts-MBP hudi % uname -a
   Darwin Alberts-MBP.localdomain 22.4.0 Darwin Kernel Version 22.4.0: Mon Mar  
6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020 arm64
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alberttwong commented on issue #8623: [SUPPORT] Following Docker Demo Quickstart with OpenJDK 1.8 on Mac arm64

2023-05-02 Thread via GitHub



alberttwong commented on issue #8623:
URL: https://github.com/apache/hudi/issues/8623#issuecomment-1532279267

   this didn't work either
   
   ```
   atwong@Alberts-MBP hudi % java -version
   openjdk version "1.8.0_372"
   OpenJDK Runtime Environment (Zulu 8.70.0.23-CA-macos-aarch64) (build 
1.8.0_372-b07)
   OpenJDK 64-Bit Server VM (Zulu 8.70.0.23-CA-macos-aarch64) (build 
25.372-b07, mixed mode)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alberttwong commented on issue #8623: [SUPPORT] Following Docker Demo Quickstart with OpenJDK 1.8 on Mac arm64

2023-05-02 Thread via GitHub



alberttwong commented on issue #8623:
URL: https://github.com/apache/hudi/issues/8623#issuecomment-1532276315

   also tried 
   ```
   atwong@Alberts-MBP hudi % java -version
   openjdk version "1.8.0_372"
   OpenJDK Runtime Environment (Temurin)(build 1.8.0_372-b07)
   OpenJDK 64-Bit Server VM (Temurin)(build 25.372-b07, mixed mode)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alberttwong opened a new issue, #8623: [SUPPORT] Following Docker Demo Quickstart with OpenJDK 1.8 on Mac arm64

2023-05-02 Thread via GitHub



alberttwong opened a new issue, #8623:
URL: https://github.com/apache/hudi/issues/8623

   I'm getting this issue using OpenJDK 1.8
   
   ```
   [INFO] 

   [INFO] BUILD FAILURE
   [INFO] 

   [INFO] Total time:  6.033 s
   [INFO] Finished at: 2023-05-02T16:06:42-07:00
   [INFO] 

   [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) 
on project hudi-common: Compilation failure: Compilation failure:
   [ERROR] 
/Users/atwong/sandbox/hudi/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:[206,9]
 no suitable method found for 
collect(java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>)
   [ERROR] method 
java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer)
 is not applicable
   [ERROR]   (cannot infer type-variable(s) R
   [ERROR] (actual and formal argument lists differ in length))
   [ERROR] method 
java.util.stream.Stream.collect(java.util.stream.Collector) is not applicable
   [ERROR]   (cannot infer type-variable(s) R,A
   [ERROR] (argument mismatch; 
java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>
 cannot be converted to java.util.stream.Collector))
   [ERROR] 
/Users/atwong/sandbox/hudi/hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:[339,11]
 no suitable method found for 
collect(java.util.stream.Collector,capture#2
 of 
?,java.util.Map>>>)
   [ERROR] method 
java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer)
 is not applicable
   [ERROR]   (cannot infer type-variable(s) R
   [ERROR] (actual and formal argument lists differ in length))
   [ERROR] method 
java.util.stream.Stream.collect(java.util.stream.Collector) is not applicable
   [ERROR]   (cannot infer type-variable(s) R,A
   [ERROR] (argument mismatch; 
java.util.stream.Collector,capture#2
 of 
?,java.util.Map>>>
 cannot be converted to java.util.stream.Collector))
   [ERROR] -> [Help 1]
   [ERROR]
   [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
   [ERROR] Re-run Maven using the -X switch to enable full debug logging.
   [ERROR]
   [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
   [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
   [ERROR]
   [ERROR] After correcting the problems, you can resume the build with the 
command
   [ERROR]   mvn  -rf :hudi-common
   atwong@Alberts-MBP hudi % java -version
   openjdk version "1.8.0_292"
   OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
   OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
   atwong@Alberts-MBP hudi % jenv version
   openjdk64-1.8.0.292 (set by /Users/atwong/.jenv/version)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alberttwong commented on issue #5552: [SUPPORT] It failed to compile raw hudi src with error "oodieTableMetadataUtil.java:[189,7] no suitable method found for collect(java.util.stream.

2023-05-02 Thread via GitHub



alberttwong commented on issue #5552:
URL: https://github.com/apache/hudi/issues/5552#issuecomment-1532262775

   ```
   [INFO] 

   [INFO] BUILD FAILURE
   [INFO] 

   [INFO] Total time:  6.033 s
   [INFO] Finished at: 2023-05-02T16:06:42-07:00
   [INFO] 

   [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) 
on project hudi-common: Compilation failure: Compilation failure:
   [ERROR] 
/Users/atwong/sandbox/hudi/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:[206,9]
 no suitable method found for 
collect(java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>)
   [ERROR] method 
java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer)
 is not applicable
   [ERROR]   (cannot infer type-variable(s) R
   [ERROR] (actual and formal argument lists differ in length))
   [ERROR] method 
java.util.stream.Stream.collect(java.util.stream.Collector) is not applicable
   [ERROR]   (cannot infer type-variable(s) R,A
   [ERROR] (argument mismatch; 
java.util.stream.Collector,capture#1
 of 
?,java.util.Map>>
 cannot be converted to java.util.stream.Collector))
   [ERROR] 
/Users/atwong/sandbox/hudi/hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:[339,11]
 no suitable method found for 
collect(java.util.stream.Collector,capture#2
 of 
?,java.util.Map>>>)
   [ERROR] method 
java.util.stream.Stream.collect(java.util.function.Supplier,java.util.function.BiConsumer,java.util.function.BiConsumer)
 is not applicable
   [ERROR]   (cannot infer type-variable(s) R
   [ERROR] (actual and formal argument lists differ in length))
   [ERROR] method 
java.util.stream.Stream.collect(java.util.stream.Collector) is not applicable
   [ERROR]   (cannot infer type-variable(s) R,A
   [ERROR] (argument mismatch; 
java.util.stream.Collector,capture#2
 of 
?,java.util.Map>>>
 cannot be converted to java.util.stream.Collector))
   [ERROR] -> [Help 1]
   [ERROR]
   [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
   [ERROR] Re-run Maven using the -X switch to enable full debug logging.
   [ERROR]
   [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
   [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
   [ERROR]
   [ERROR] After correcting the problems, you can resume the build with the 
command
   [ERROR]   mvn  -rf :hudi-common
   atwong@Alberts-MBP hudi % java -version
   openjdk version "1.8.0_292"
   OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
   OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)
   atwong@Alberts-MBP hudi % jenv version
   openjdk64-1.8.0.292 (set by /Users/atwong/.jenv/version)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1532254235

   
   ## CI report:
   
   * 3cfef7fc92a6c5ce9bb078a7186e04614c11647f UNKNOWN
   * 3ad5ae580928952bb601cf90f09abb53d1d436e4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16798)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8622: [HUDI-6163] Add PR size labeler

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8622:
URL: https://github.com/apache/hudi/pull/8622#issuecomment-1532250009

   
   ## CI report:
   
   * fed3fee57fb882abd972995f24182b8598b6c576 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16797)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #8606: [MINOR] Check the return value from delete during rollback and finalize to ensure the files actually got deleted.

2023-05-02 Thread via GitHub



nbalajee commented on code in PR #8606:
URL: https://github.com/apache/hudi/pull/8606#discussion_r1183090630


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackHelper.java:
##
@@ -197,14 +197,21 @@ protected List 
deleteFiles(HoodieTableMetaClient metaClient,
 // if first rollback attempt failed and retried again, chances 
that some files are already deleted.
 isDeleted = true;
   }
+
+  if (!isDeleted) {

Review Comment:
   
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html#delete(org.apache.hadoop.fs.Path,%20boolean)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nbalajee commented on a diff in pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.

2023-05-02 Thread via GitHub



nbalajee commented on code in PR #8604:
URL: https://github.com/apache/hudi/pull/8604#discussion_r1183087756


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##
@@ -161,27 +161,28 @@ protected void commit(String instantTime, 
Map alreadyCompletedInstant = 
metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry 
-> entry.getTimestamp().equals(instantTime)).lastInstant();
-if (alreadyCompletedInstant.isPresent()) {
-  // this code path refers to a re-attempted commit that got committed 
to metadata table, but failed in datatable.
-  // for eg, lets say compaction c1 on 1st attempt succeeded in 
metadata table and failed before committing to datatable.
-  // when retried again, data table will first rollback pending 
compaction. these will be applied to metadata table, but all changes
-  // are upserts to metadata table and so only a new delta commit will 
be created.
-  // once rollback is complete, compaction will be retried again, 
which will eventually hit this code block where the respective commit is
-  // already part of completed commit. So, we have to manually remove 
the completed instant and proceed.
-  // and it is for the same reason we enabled 
withAllowMultiWriteOnSameInstant for metadata table.
-  HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), 
metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get());
-  metadataMetaClient.reloadActiveTimeline();
+LOG.info(String.format("%s completed commit at %s being applied to 
metadata table",
+alreadyCompletedInstant.isPresent() ? "Already" : "Partially", 
instantTime));
+
+// Rollback the previous committed commit
+if (!writeClient.rollback(instantTime)) {

Review Comment:
   Older solution of removing the completed action and reattempt won't work in 
all scenarios.   We will have to consider the following scenarios:
   (1)  c1.commit  failed on the main dataset;  On MDT, c1.deltacommit was 
completed.  
 (a) with record index enabled, new log block was added to the log file 
by c1.deltacommit. Simply removing deltacommit, may not be enough and will 
require additional action to rollback the logblock, to keep the log file 
consistent.
(2) c1.clean was attempted.  c1.deltacommit was completed.  When clean is 
retried, second attempt could bring in some of the files that were in the 
"failed" list of the first attempt (vs the "success" list).
(3) c1.rollback was attempted. c1.deltacommit was completed.  (We fixed an 
issue with incomplete rollback, with MDT updated with deltacommit, scenario.  
This change played a role in this scenario as well).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-6153) Change the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks

2023-05-02 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718734#comment-17718734
 ] 

sivabalan narayanan commented on HUDI-6153:
---

1 FG 5 versions. 

1 clean, which cleaned up first 1 version.

we have 4 slices. 

C1(cleaned up), C2,... C5. c6.clean

MDT: will have 5 files. 
after C6 is applied to MDT,
only 4 files (c2 ... c5)

restore to C4. 
MDT will roll back to C4. 

we don't re-apply or negate C6 since rollback is applicable only for write 
timeline. 
so we need to re-apply the cleaned commits.
 

> Change the rollback mechanism for MDT to actual rollbacks rather than 
> appending revert blocks
> -
>
> Key: HUDI-6153
> URL: https://issues.apache.org/jira/browse/HUDI-6153
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.14.0, 1.0.0
>
>
> When rolling back completed commits for indexes like record-index, the list 
> of all keys removed from the dataset is required. This information cannot be 
> available during rollback processing in MDT since the files have already been 
> deleted during the rollback inflight processing. 
> Hence, the current MDT rollback mechanism of adding -files, -col_stats 
> entries does not work for record index.
> This PR changes the rollback mechanism to actually rollback deltacommits on 
> the MDT. This makes the rollback handing faster and keeps the MDT in sync 
> with dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1532151375

   
   ## CI report:
   
   * 3cfef7fc92a6c5ce9bb078a7186e04614c11647f UNKNOWN
   * e7795634f222d6d27363dc4900c9fb458105ffce Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16743)
 
   * 3ad5ae580928952bb601cf90f09abb53d1d436e4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16798)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8303: [HUDI-5998] Speed up reads from bootstrapped tables in spark

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8303:
URL: https://github.com/apache/hudi/pull/8303#issuecomment-1532141550

   
   ## CI report:
   
   * 3cfef7fc92a6c5ce9bb078a7186e04614c11647f UNKNOWN
   * e7795634f222d6d27363dc4900c9fb458105ffce Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16743)
 
   * 3ad5ae580928952bb601cf90f09abb53d1d436e4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6104) Clean deleted partition with clean policy

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6104:

Status: Patch Available  (was: In Progress)

> Clean deleted partition with clean policy
> -
>
> Key: HUDI-6104
> URL: https://issues.apache.org/jira/browse/HUDI-6104
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: HBG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6104) Clean deleted partition with clean policy

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6104:

Status: In Progress  (was: Open)

> Clean deleted partition with clean policy
> -
>
> Key: HUDI-6104
> URL: https://issues.apache.org/jira/browse/HUDI-6104
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: HBG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6105) Partial update for MERGE INTO

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6105:

Status: Patch Available  (was: In Progress)

> Partial update for MERGE INTO
> -
>
> Key: HUDI-6105
> URL: https://issues.apache.org/jira/browse/HUDI-6105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6105) Partial update for MERGE INTO

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6105:

Fix Version/s: 1.0.0

> Partial update for MERGE INTO
> -
>
> Key: HUDI-6105
> URL: https://issues.apache.org/jira/browse/HUDI-6105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6104) Clean deleted partition with clean policy

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6104:

Fix Version/s: 0.14.0

> Clean deleted partition with clean policy
> -
>
> Key: HUDI-6104
> URL: https://issues.apache.org/jira/browse/HUDI-6104
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: HBG
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6106) Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6106:

Status: In Progress  (was: Open)

> Spark offline compaction/Clustering Job will do clean like Flink job
> 
>
> Key: HUDI-6106
> URL: https://issues.apache.org/jira/browse/HUDI-6106
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, compaction, spark
>Reporter: zhuanshenbsj1
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6105) Partial update for MERGE INTO

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6105:

Status: In Progress  (was: Open)

> Partial update for MERGE INTO
> -
>
> Key: HUDI-6105
> URL: https://issues.apache.org/jira/browse/HUDI-6105
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Danny Chen
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6106) Spark offline compaction/Clustering Job will do clean like Flink job

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6106:

Status: Patch Available  (was: In Progress)

> Spark offline compaction/Clustering Job will do clean like Flink job
> 
>
> Key: HUDI-6106
> URL: https://issues.apache.org/jira/browse/HUDI-6106
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering, compaction, spark
>Reporter: zhuanshenbsj1
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6107) Fix java.lang.IllegalArgumentException for bootstrap

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6107:

Fix Version/s: 0.14.0

>  Fix java.lang.IllegalArgumentException  for bootstrap
> --
>
> Key: HUDI-6107
> URL: https://issues.apache.org/jira/browse/HUDI-6107
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: weiming
>Assignee: weiming
>Priority: Critical
> Fix For: 0.14.0
>
>
> The hive table of orc or parquet type will report the same error
>  
> my command：
> spark-submit \
> --queue root.default_queue \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  
> hudijar/hudi-utilities-bundle_2.12-0.13.0.jar \
> --run-bootstrap \
> --target-base-path 
> /user/prod_datalake_test/datalake_test/hive/datalake_test/wm_test_bootstrap_hudi01
>  \
> --target-table wm_test_bootstrap_hudi01 \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --hoodie-conf 
> hoodie.bootstrap.base.path=/user/prod_datalake_test/datalake_test_dev/hive/datalake_test_dev/wm_test_bootstrap_hudi01
>  \
> --hoodie-conf hoodie.datasource.write.recordkey.field=account \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=dt \
> --hoodie-conf hoodie.datasource.write.precombine.field=account \
> --hoodie-conf 
> hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
> --hoodie-conf 
> hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkOrcBootstrapDataProvider
>  \
> --hoodie-conf 
> hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
>  \
> --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --enable-sync  \
> --hoodie-conf hoodie.datasource.hive_sync.mode=HMS \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake_test \
> --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \
> --hoodie-conf hoodie.datasource.hive_sync.create_managed_table=true \
> --hoodie-conf hoodie.datasource.hive_sync.table=wm_test_bootstrap_hudi01 \
> --hoodie-conf hoodie.datasource.hive_sync.partition_fields=dt \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
>  
> error log:
> 23/04/20 14:12:43 ERROR ApplicationMaster: User class threw exception: 
> java.lang.IllegalArgumentException java.lang.IllegalArgumentException at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.listAndProcessSourcePartitions(SparkBootstrapCommitActionExecutor.java:337)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.execute(SparkBootstrapCommitActionExecutor.java:134)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.bootstrap(HoodieSparkCopyOnWriteTable.java:187)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.bootstrap(SparkRDDWriteClient.java:131)
>  at 
> org.apache.hudi.utilities.deltastreamer.BootstrapExecutor.execute(BootstrapExecutor.java:167)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:189)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:573)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:748)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6107) Fix java.lang.IllegalArgumentException for bootstrap

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6107:

Fix Version/s: 1.0.0

>  Fix java.lang.IllegalArgumentException  for bootstrap
> --
>
> Key: HUDI-6107
> URL: https://issues.apache.org/jira/browse/HUDI-6107
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: weiming
>Assignee: weiming
>Priority: Blocker
> Fix For: 0.14.0, 1.0.0
>
>
> The hive table of orc or parquet type will report the same error
>  
> my command：
> spark-submit \
> --queue root.default_queue \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  
> hudijar/hudi-utilities-bundle_2.12-0.13.0.jar \
> --run-bootstrap \
> --target-base-path 
> /user/prod_datalake_test/datalake_test/hive/datalake_test/wm_test_bootstrap_hudi01
>  \
> --target-table wm_test_bootstrap_hudi01 \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --hoodie-conf 
> hoodie.bootstrap.base.path=/user/prod_datalake_test/datalake_test_dev/hive/datalake_test_dev/wm_test_bootstrap_hudi01
>  \
> --hoodie-conf hoodie.datasource.write.recordkey.field=account \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=dt \
> --hoodie-conf hoodie.datasource.write.precombine.field=account \
> --hoodie-conf 
> hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
> --hoodie-conf 
> hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkOrcBootstrapDataProvider
>  \
> --hoodie-conf 
> hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
>  \
> --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --enable-sync  \
> --hoodie-conf hoodie.datasource.hive_sync.mode=HMS \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake_test \
> --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \
> --hoodie-conf hoodie.datasource.hive_sync.create_managed_table=true \
> --hoodie-conf hoodie.datasource.hive_sync.table=wm_test_bootstrap_hudi01 \
> --hoodie-conf hoodie.datasource.hive_sync.partition_fields=dt \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
>  
> error log:
> 23/04/20 14:12:43 ERROR ApplicationMaster: User class threw exception: 
> java.lang.IllegalArgumentException java.lang.IllegalArgumentException at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.listAndProcessSourcePartitions(SparkBootstrapCommitActionExecutor.java:337)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.execute(SparkBootstrapCommitActionExecutor.java:134)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.bootstrap(HoodieSparkCopyOnWriteTable.java:187)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.bootstrap(SparkRDDWriteClient.java:131)
>  at 
> org.apache.hudi.utilities.deltastreamer.BootstrapExecutor.execute(BootstrapExecutor.java:167)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:189)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:573)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:748)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6107) Fix java.lang.IllegalArgumentException for bootstrap

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6107:

Affects Version/s: 0.13.0

>  Fix java.lang.IllegalArgumentException  for bootstrap
> --
>
> Key: HUDI-6107
> URL: https://issues.apache.org/jira/browse/HUDI-6107
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: weiming
>Assignee: weiming
>Priority: Blocker
> Fix For: 0.14.0
>
>
> The hive table of orc or parquet type will report the same error
>  
> my command：
> spark-submit \
> --queue root.default_queue \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  
> hudijar/hudi-utilities-bundle_2.12-0.13.0.jar \
> --run-bootstrap \
> --target-base-path 
> /user/prod_datalake_test/datalake_test/hive/datalake_test/wm_test_bootstrap_hudi01
>  \
> --target-table wm_test_bootstrap_hudi01 \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --hoodie-conf 
> hoodie.bootstrap.base.path=/user/prod_datalake_test/datalake_test_dev/hive/datalake_test_dev/wm_test_bootstrap_hudi01
>  \
> --hoodie-conf hoodie.datasource.write.recordkey.field=account \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=dt \
> --hoodie-conf hoodie.datasource.write.precombine.field=account \
> --hoodie-conf 
> hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
> --hoodie-conf 
> hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkOrcBootstrapDataProvider
>  \
> --hoodie-conf 
> hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
>  \
> --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --enable-sync  \
> --hoodie-conf hoodie.datasource.hive_sync.mode=HMS \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake_test \
> --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \
> --hoodie-conf hoodie.datasource.hive_sync.create_managed_table=true \
> --hoodie-conf hoodie.datasource.hive_sync.table=wm_test_bootstrap_hudi01 \
> --hoodie-conf hoodie.datasource.hive_sync.partition_fields=dt \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
>  
> error log:
> 23/04/20 14:12:43 ERROR ApplicationMaster: User class threw exception: 
> java.lang.IllegalArgumentException java.lang.IllegalArgumentException at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.listAndProcessSourcePartitions(SparkBootstrapCommitActionExecutor.java:337)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.execute(SparkBootstrapCommitActionExecutor.java:134)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.bootstrap(HoodieSparkCopyOnWriteTable.java:187)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.bootstrap(SparkRDDWriteClient.java:131)
>  at 
> org.apache.hudi.utilities.deltastreamer.BootstrapExecutor.execute(BootstrapExecutor.java:167)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:189)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:573)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:748)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6107) Fix java.lang.IllegalArgumentException for bootstrap

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6107:

Priority: Blocker  (was: Critical)

>  Fix java.lang.IllegalArgumentException  for bootstrap
> --
>
> Key: HUDI-6107
> URL: https://issues.apache.org/jira/browse/HUDI-6107
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: weiming
>Assignee: weiming
>Priority: Blocker
> Fix For: 0.14.0
>
>
> The hive table of orc or parquet type will report the same error
>  
> my command：
> spark-submit \
> --queue root.default_queue \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  
> hudijar/hudi-utilities-bundle_2.12-0.13.0.jar \
> --run-bootstrap \
> --target-base-path 
> /user/prod_datalake_test/datalake_test/hive/datalake_test/wm_test_bootstrap_hudi01
>  \
> --target-table wm_test_bootstrap_hudi01 \
> --table-type COPY_ON_WRITE \
> --base-file-format PARQUET \
> --hoodie-conf 
> hoodie.bootstrap.base.path=/user/prod_datalake_test/datalake_test_dev/hive/datalake_test_dev/wm_test_bootstrap_hudi01
>  \
> --hoodie-conf hoodie.datasource.write.recordkey.field=account \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=dt \
> --hoodie-conf hoodie.datasource.write.precombine.field=account \
> --hoodie-conf 
> hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
> --hoodie-conf 
> hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkOrcBootstrapDataProvider
>  \
> --hoodie-conf 
> hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
>  \
> --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --enable-sync  \
> --hoodie-conf hoodie.datasource.hive_sync.mode=HMS \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake_test \
> --hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \
> --hoodie-conf hoodie.datasource.hive_sync.create_managed_table=true \
> --hoodie-conf hoodie.datasource.hive_sync.table=wm_test_bootstrap_hudi01 \
> --hoodie-conf hoodie.datasource.hive_sync.partition_fields=dt \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
>  
> error log:
> 23/04/20 14:12:43 ERROR ApplicationMaster: User class threw exception: 
> java.lang.IllegalArgumentException java.lang.IllegalArgumentException at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.listAndProcessSourcePartitions(SparkBootstrapCommitActionExecutor.java:337)
>  at 
> org.apache.hudi.table.action.bootstrap.SparkBootstrapCommitActionExecutor.execute(SparkBootstrapCommitActionExecutor.java:134)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.bootstrap(HoodieSparkCopyOnWriteTable.java:187)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.bootstrap(SparkRDDWriteClient.java:131)
>  at 
> org.apache.hudi.utilities.deltastreamer.BootstrapExecutor.execute(BootstrapExecutor.java:167)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:189)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:573)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:748)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6110) Hudi DOAP file error

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6110:

Fix Version/s: 0.14.0

> Hudi DOAP file error
> 
>
> Key: HUDI-6110
> URL: https://issues.apache.org/jira/browse/HUDI-6110
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Claude Warren
>Priority: Minor
>  Labels: documentation
> Fix For: 0.14.0
>
>
> The DOAP file [1] as listed in [2] has the error:
> [line: 44, col: 16] \{E201} Multiple children of property element
> [1] 
> https://gitbox.apache.org/repos/asf?p=hudi.git;a=blob_plain;f=doap_HUDI.rdf;hb=HEAD
> [2] 
> https://svn.apache.org/repos/asf/comdev/projects.apache.org/trunk/data/projects.xml



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6111) Build hudi submodule cause checkstyle not found error

2023-05-02 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718714#comment-17718714
 ] 

Ethan Guo commented on HUDI-6111:
-

Hi [~lemonjing] could you provide your mvn build command of building hudi-cli 
module?

> Build hudi submodule cause checkstyle not found error
> -
>
> Key: HUDI-6111
> URL: https://issues.apache.org/jira/browse/HUDI-6111
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compile
>Reporter: Ran Tao
>Priority: Major
>
> when i build hudi source from root dir, it's ok.
> but when i build a certain module such as hudi-cli, it cause
>  
> ```
> [INFO] — maven-checkstyle-plugin:3.0.0:check (default) @ hudi-cli —
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  01:46 min
> [INFO] Finished at: 2023-04-20T15:19:25+08:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:3.0.0:check (default) on 
> project hudi-cli: Failed during checkstyle configuration: cannot initialize 
> module TreeWalker - cannot initialize module ImportControl - illegal value 
> '/Users/xxx/GitHub/hudi/hudi-cli/style/import-control.xml' for property 
> 'file': com.puppycrawl.tools.checkstyle.api.CheckstyleException: Unable to 
> find: /Users/xxx/GitHub/hudi/hudi-cli/style/import-control.xml -> [Help 1]
> ```
> The problem seems clear that import-control.xml is not found. because 
> import-control.xml belongs to root style dir. 
> I have tried many submodules, they also have this problem. I wonder whether 
> hudi design build like this or my individual error.
> My env is:
> mac os m1 pro. 13.2.1 
> jdk 8
> mvn 3.6.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6111) Build hudi submodule cause checkstyle not found error

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6111:

Fix Version/s: 0.14.0

> Build hudi submodule cause checkstyle not found error
> -
>
> Key: HUDI-6111
> URL: https://issues.apache.org/jira/browse/HUDI-6111
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compile
>Reporter: Ran Tao
>Priority: Major
> Fix For: 0.14.0
>
>
> when i build hudi source from root dir, it's ok.
> but when i build a certain module such as hudi-cli, it cause
>  
> ```
> [INFO] — maven-checkstyle-plugin:3.0.0:check (default) @ hudi-cli —
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  01:46 min
> [INFO] Finished at: 2023-04-20T15:19:25+08:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:3.0.0:check (default) on 
> project hudi-cli: Failed during checkstyle configuration: cannot initialize 
> module TreeWalker - cannot initialize module ImportControl - illegal value 
> '/Users/xxx/GitHub/hudi/hudi-cli/style/import-control.xml' for property 
> 'file': com.puppycrawl.tools.checkstyle.api.CheckstyleException: Unable to 
> find: /Users/xxx/GitHub/hudi/hudi-cli/style/import-control.xml -> [Help 1]
> ```
> The problem seems clear that import-control.xml is not found. because 
> import-control.xml belongs to root style dir. 
> I have tried many submodules, they also have this problem. I wonder whether 
> hudi design build like this or my individual error.
> My env is:
> mac os m1 pro. 13.2.1 
> jdk 8
> mvn 3.6.3
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6113) Support multiple transformers using the same config keys in DeltaStreamer

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6113:

Fix Version/s: 0.14.0

> Support multiple transformers using the same config keys in DeltaStreamer
> -
>
> Key: HUDI-6113
> URL: https://issues.apache.org/jira/browse/HUDI-6113
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Affects Versions: 0.14.0
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently DeltaStreamers supports two or more transformers of the same type 
> (using the same configs). But these transformers are using the same config 
> keys and could require these keys to be configured with different values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6113) Support multiple transformers using the same config keys in DeltaStreamer

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6113:

Priority: Blocker  (was: Major)

> Support multiple transformers using the same config keys in DeltaStreamer
> -
>
> Key: HUDI-6113
> URL: https://issues.apache.org/jira/browse/HUDI-6113
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Affects Versions: 0.14.0
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Currently DeltaStreamers supports two or more transformers of the same type 
> (using the same configs). But these transformers are using the same config 
> keys and could require these keys to be configured with different values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6112:

Fix Version/s: 0.14.0

> Improve Doc generatiion to generate config tables for basic and advanced 
> configs
> 
>
> Key: HUDI-6112
> URL: https://issues.apache.org/jira/browse/HUDI-6112
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The HoodieConfigDocGenerator will need to be modified such that:
>  * Each config group has two sections: basic configs and advanced configs
>  * Basic configs and Advanced configs are played out in a table instead of a 
> serially like today.
>  * Among each of these tables the required configs are bubbled up to the top 
> of the table and highlighted.
> Add UI fixes to support a table layout



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6113) Support multiple transformers using the same config keys in DeltaStreamer

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6113:

Status: Patch Available  (was: In Progress)

> Support multiple transformers using the same config keys in DeltaStreamer
> -
>
> Key: HUDI-6113
> URL: https://issues.apache.org/jira/browse/HUDI-6113
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Affects Versions: 0.14.0
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently DeltaStreamers supports two or more transformers of the same type 
> (using the same configs). But these transformers are using the same config 
> keys and could require these keys to be configured with different values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6113) Support multiple transformers using the same config keys in DeltaStreamer

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6113:

Status: In Progress  (was: Open)

> Support multiple transformers using the same config keys in DeltaStreamer
> -
>
> Key: HUDI-6113
> URL: https://issues.apache.org/jira/browse/HUDI-6113
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Affects Versions: 0.14.0
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Currently DeltaStreamers supports two or more transformers of the same type 
> (using the same configs). But these transformers are using the same config 
> keys and could require these keys to be configured with different values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6115) Harden expected corrupt record column in chained transformer when error table settings are on/off

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6115:

Status: In Progress  (was: Open)

> Harden expected corrupt record column in chained transformer when error table 
> settings are on/off 
> --
>
> Key: HUDI-6115
> URL: https://issues.apache.org/jira/browse/HUDI-6115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
>
> When error table is enabled and transformers drops existing 
> corruptRecordColumn , that can lead to quarantine records getting dropped . 
> This pr aims at hardening expectation of corruptRecordColumn in output 
> schemas of transformer . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6115) Harden expected corrupt record column in chained transformer when error table settings are on/off

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6115:

Status: Patch Available  (was: In Progress)

> Harden expected corrupt record column in chained transformer when error table 
> settings are on/off 
> --
>
> Key: HUDI-6115
> URL: https://issues.apache.org/jira/browse/HUDI-6115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
>
> When error table is enabled and transformers drops existing 
> corruptRecordColumn , that can lead to quarantine records getting dropped . 
> This pr aims at hardening expectation of corruptRecordColumn in output 
> schemas of transformer . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6115) Harden expected corrupt record column in chained transformer when error table settings are on/off

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6115:

Fix Version/s: 0.14.0

> Harden expected corrupt record column in chained transformer when error table 
> settings are on/off 
> --
>
> Key: HUDI-6115
> URL: https://issues.apache.org/jira/browse/HUDI-6115
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Harshal Patil
>Assignee: Harshal Patil
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When error table is enabled and transformers drops existing 
> corruptRecordColumn , that can lead to quarantine records getting dropped . 
> This pr aims at hardening expectation of corruptRecordColumn in output 
> schemas of transformer . 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6116:

Status: Patch Available  (was: In Progress)

> Optimize log block reading by removing seeks to check corrupted blocks
> --
>
> Key: HUDI-6116
> URL: https://issues.apache.org/jira/browse/HUDI-6116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> The code currently does an eager isCorruptedCheck for which we do a seek and 
> then a read which invalidates our internal buffers in opened file stream to 
> the log file and makes a call to DataNode to start a new blockReader.
> The seek + read becomes apparent when we do cross datacenter reads or where 
> the latency to the file is HIGH. In cases, a single RPC will cost us about 
> 120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
> performance.
> Delaying the corrupt check also gives us many benefits in low latency env 
> where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
> moderately sized files of 250MB.
> NOTE:  The more number of log blocks to read, the greater the performance 
> improvements.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6116:

Fix Version/s: 0.14.0

> Optimize log block reading by removing seeks to check corrupted blocks
> --
>
> Key: HUDI-6116
> URL: https://issues.apache.org/jira/browse/HUDI-6116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> The code currently does an eager isCorruptedCheck for which we do a seek and 
> then a read which invalidates our internal buffers in opened file stream to 
> the log file and makes a call to DataNode to start a new blockReader.
> The seek + read becomes apparent when we do cross datacenter reads or where 
> the latency to the file is HIGH. In cases, a single RPC will cost us about 
> 120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
> performance.
> Delaying the corrupt check also gives us many benefits in low latency env 
> where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
> moderately sized files of 250MB.
> NOTE:  The more number of log blocks to read, the greater the performance 
> improvements.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6116:

Status: In Progress  (was: Open)

> Optimize log block reading by removing seeks to check corrupted blocks
> --
>
> Key: HUDI-6116
> URL: https://issues.apache.org/jira/browse/HUDI-6116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> The code currently does an eager isCorruptedCheck for which we do a seek and 
> then a read which invalidates our internal buffers in opened file stream to 
> the log file and makes a call to DataNode to start a new blockReader.
> The seek + read becomes apparent when we do cross datacenter reads or where 
> the latency to the file is HIGH. In cases, a single RPC will cost us about 
> 120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
> performance.
> Delaying the corrupt check also gives us many benefits in low latency env 
> where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
> moderately sized files of 250MB.
> NOTE:  The more number of log blocks to read, the greater the performance 
> improvements.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6116) Optimize log block reading by removing seeks to check corrupted blocks

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6116:

Fix Version/s: 1.0.0

> Optimize log block reading by removing seeks to check corrupted blocks
> --
>
> Key: HUDI-6116
> URL: https://issues.apache.org/jira/browse/HUDI-6116
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> The code currently does an eager isCorruptedCheck for which we do a seek and 
> then a read which invalidates our internal buffers in opened file stream to 
> the log file and makes a call to DataNode to start a new blockReader.
> The seek + read becomes apparent when we do cross datacenter reads or where 
> the latency to the file is HIGH. In cases, a single RPC will cost us about 
> 120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
> performance.
> Delaying the corrupt check also gives us many benefits in low latency env 
> where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
> moderately sized files of 250MB.
> NOTE:  The more number of log blocks to read, the greater the performance 
> improvements.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6118) Provide reasonable defaults for operation parallelism in MDT write configuration

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6118:

Fix Version/s: 0.14.0
   1.0.0

> Provide reasonable defaults for operation parallelism in MDT write 
> configuration
> 
>
> Key: HUDI-6118
> URL: https://issues.apache.org/jira/browse/HUDI-6118
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.14.0, 1.0.0
>
>
> The current defaults are not optimal for large partitions like record index. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6118) Provide reasonable defaults for operation parallelism in MDT write configuration

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6118:

Component/s: metadata

> Provide reasonable defaults for operation parallelism in MDT write 
> configuration
> 
>
> Key: HUDI-6118
> URL: https://issues.apache.org/jira/browse/HUDI-6118
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: metadata
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.14.0, 1.0.0
>
>
> The current defaults are not optimal for large partitions like record index. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6117) Parallelize creation of initial file groups for MDT partitions

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6117:

Fix Version/s: 0.14.0

> Parallelize creation of initial file groups for MDT partitions
> --
>
> Key: HUDI-6117
> URL: https://issues.apache.org/jira/browse/HUDI-6117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When there are large number of file groups in a MDT partition (record index 
> having billions of records can have 10k+ partitions, creating the initial log 
> files in a for-loop can take a long time (100ms per create = 1000 seconds = 
> 16mins) but routinely this latency is as high as 500msec / create.
> The initial file group creation can be optimized to be done parallelly 
> speeding up MDT partition initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6117) Parallelize creation of initial file groups for MDT partitions

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6117:

Fix Version/s: 1.0.0

> Parallelize creation of initial file groups for MDT partitions
> --
>
> Key: HUDI-6117
> URL: https://issues.apache.org/jira/browse/HUDI-6117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> When there are large number of file groups in a MDT partition (record index 
> having billions of records can have 10k+ partitions, creating the initial log 
> files in a for-loop can take a long time (100ms per create = 1000 seconds = 
> 16mins) but routinely this latency is as high as 500msec / create.
> The initial file group creation can be optimized to be done parallelly 
> speeding up MDT partition initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6117) Parallelize creation of initial file groups for MDT partitions

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6117:

Status: In Progress  (was: Open)

> Parallelize creation of initial file groups for MDT partitions
> --
>
> Key: HUDI-6117
> URL: https://issues.apache.org/jira/browse/HUDI-6117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> When there are large number of file groups in a MDT partition (record index 
> having billions of records can have 10k+ partitions, creating the initial log 
> files in a for-loop can take a long time (100ms per create = 1000 seconds = 
> 16mins) but routinely this latency is as high as 500msec / create.
> The initial file group creation can be optimized to be done parallelly 
> speeding up MDT partition initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6117) Parallelize creation of initial file groups for MDT partitions

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6117:

Status: Patch Available  (was: In Progress)

> Parallelize creation of initial file groups for MDT partitions
> --
>
> Key: HUDI-6117
> URL: https://issues.apache.org/jira/browse/HUDI-6117
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> When there are large number of file groups in a MDT partition (record index 
> having billions of records can have 10k+ partitions, creating the initial log 
> files in a for-loop can take a long time (100ms per create = 1000 seconds = 
> 16mins) but routinely this latency is as high as 500msec / create.
> The initial file group creation can be optimized to be done parallelly 
> speeding up MDT partition initialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6120) fetchAllLogsMergedFileSlice will read basefile which it does not expect

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6120:

Fix Version/s: 1.0.0

> fetchAllLogsMergedFileSlice will read basefile which it does not expect
> ---
>
> Key: HUDI-6120
> URL: https://issues.apache.org/jira/browse/HUDI-6120
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> Check the code snippet of 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView#fetchAllLogsMergedFileSlice:
> {code:java}
> private Option fetchAllLogsMergedFileSlice(HoodieFileGroup 
> fileGroup, String maxInstantTime) {
>   List fileSlices = 
> fileGroup.getAllFileSlicesBeforeOn(maxInstantTime).collect(Collectors.toList());
>   if (fileSlices.size() == 0) {
> return Option.empty();
>   }
>   if (fileSlices.size() == 1) {
> return Option.of(fileSlices.get(0));
>   }
>   final FileSlice latestSlice = fileSlices.get(0);
>   FileSlice merged = new FileSlice(latestSlice.getPartitionPath(), 
> latestSlice.getBaseInstantTime(),
>   latestSlice.getFileId());
>   // add log files from the latest slice to the earliest
>   fileSlices.forEach(slice -> 
> slice.getLogFiles().forEach(merged::addLogFile));
>   return Option.of(merged);
> }{code}
> if we only fetch one file slice, we will return the file slice with basefile, 
> and then hudi-flink will create a SkipMergeIterator/MergeIterator which both 
> reads basefile and logfiles for the split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6120) fetchAllLogsMergedFileSlice will read basefile which it does not expect

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6120:

Status: In Progress  (was: Open)

> fetchAllLogsMergedFileSlice will read basefile which it does not expect
> ---
>
> Key: HUDI-6120
> URL: https://issues.apache.org/jira/browse/HUDI-6120
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> Check the code snippet of 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView#fetchAllLogsMergedFileSlice:
> {code:java}
> private Option fetchAllLogsMergedFileSlice(HoodieFileGroup 
> fileGroup, String maxInstantTime) {
>   List fileSlices = 
> fileGroup.getAllFileSlicesBeforeOn(maxInstantTime).collect(Collectors.toList());
>   if (fileSlices.size() == 0) {
> return Option.empty();
>   }
>   if (fileSlices.size() == 1) {
> return Option.of(fileSlices.get(0));
>   }
>   final FileSlice latestSlice = fileSlices.get(0);
>   FileSlice merged = new FileSlice(latestSlice.getPartitionPath(), 
> latestSlice.getBaseInstantTime(),
>   latestSlice.getFileId());
>   // add log files from the latest slice to the earliest
>   fileSlices.forEach(slice -> 
> slice.getLogFiles().forEach(merged::addLogFile));
>   return Option.of(merged);
> }{code}
> if we only fetch one file slice, we will return the file slice with basefile, 
> and then hudi-flink will create a SkipMergeIterator/MergeIterator which both 
> reads basefile and logfiles for the split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6120) fetchAllLogsMergedFileSlice will read basefile which it does not expect

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6120:

Priority: Blocker  (was: Major)

> fetchAllLogsMergedFileSlice will read basefile which it does not expect
> ---
>
> Key: HUDI-6120
> URL: https://issues.apache.org/jira/browse/HUDI-6120
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> Check the code snippet of 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView#fetchAllLogsMergedFileSlice:
> {code:java}
> private Option fetchAllLogsMergedFileSlice(HoodieFileGroup 
> fileGroup, String maxInstantTime) {
>   List fileSlices = 
> fileGroup.getAllFileSlicesBeforeOn(maxInstantTime).collect(Collectors.toList());
>   if (fileSlices.size() == 0) {
> return Option.empty();
>   }
>   if (fileSlices.size() == 1) {
> return Option.of(fileSlices.get(0));
>   }
>   final FileSlice latestSlice = fileSlices.get(0);
>   FileSlice merged = new FileSlice(latestSlice.getPartitionPath(), 
> latestSlice.getBaseInstantTime(),
>   latestSlice.getFileId());
>   // add log files from the latest slice to the earliest
>   fileSlices.forEach(slice -> 
> slice.getLogFiles().forEach(merged::addLogFile));
>   return Option.of(merged);
> }{code}
> if we only fetch one file slice, we will return the file slice with basefile, 
> and then hudi-flink will create a SkipMergeIterator/MergeIterator which both 
> reads basefile and logfiles for the split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6120) fetchAllLogsMergedFileSlice will read basefile which it does not expect

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6120:

Status: Patch Available  (was: In Progress)

> fetchAllLogsMergedFileSlice will read basefile which it does not expect
> ---
>
> Key: HUDI-6120
> URL: https://issues.apache.org/jira/browse/HUDI-6120
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0, 1.0.0
>
>
> Check the code snippet of 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView#fetchAllLogsMergedFileSlice:
> {code:java}
> private Option fetchAllLogsMergedFileSlice(HoodieFileGroup 
> fileGroup, String maxInstantTime) {
>   List fileSlices = 
> fileGroup.getAllFileSlicesBeforeOn(maxInstantTime).collect(Collectors.toList());
>   if (fileSlices.size() == 0) {
> return Option.empty();
>   }
>   if (fileSlices.size() == 1) {
> return Option.of(fileSlices.get(0));
>   }
>   final FileSlice latestSlice = fileSlices.get(0);
>   FileSlice merged = new FileSlice(latestSlice.getPartitionPath(), 
> latestSlice.getBaseInstantTime(),
>   latestSlice.getFileId());
>   // add log files from the latest slice to the earliest
>   fileSlices.forEach(slice -> 
> slice.getLogFiles().forEach(merged::addLogFile));
>   return Option.of(merged);
> }{code}
> if we only fetch one file slice, we will return the file slice with basefile, 
> and then hudi-flink will create a SkipMergeIterator/MergeIterator which both 
> reads basefile and logfiles for the split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6120) fetchAllLogsMergedFileSlice will read basefile which it does not expect

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6120:

Fix Version/s: 0.14.0

> fetchAllLogsMergedFileSlice will read basefile which it does not expect
> ---
>
> Key: HUDI-6120
> URL: https://issues.apache.org/jira/browse/HUDI-6120
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jianhui Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Check the code snippet of 
> org.apache.hudi.common.table.view.AbstractTableFileSystemView#fetchAllLogsMergedFileSlice:
> {code:java}
> private Option fetchAllLogsMergedFileSlice(HoodieFileGroup 
> fileGroup, String maxInstantTime) {
>   List fileSlices = 
> fileGroup.getAllFileSlicesBeforeOn(maxInstantTime).collect(Collectors.toList());
>   if (fileSlices.size() == 0) {
> return Option.empty();
>   }
>   if (fileSlices.size() == 1) {
> return Option.of(fileSlices.get(0));
>   }
>   final FileSlice latestSlice = fileSlices.get(0);
>   FileSlice merged = new FileSlice(latestSlice.getPartitionPath(), 
> latestSlice.getBaseInstantTime(),
>   latestSlice.getFileId());
>   // add log files from the latest slice to the earliest
>   fileSlices.forEach(slice -> 
> slice.getLogFiles().forEach(merged::addLogFile));
>   return Option.of(merged);
> }{code}
> if we only fetch one file slice, we will return the file slice with basefile, 
> and then hudi-flink will create a SkipMergeIterator/MergeIterator which both 
> reads basefile and logfiles for the split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6121:
---

Assignee: (was: Ethan Guo)

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6121:

Status: In Progress  (was: Open)

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Priority: Minor
>  Labels: pull-request-available
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6121:

Fix Version/s: 0.14.0

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6121:
---

Assignee: Ethan Guo

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Assignee: Ethan Guo
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6121) Log the exception in the hudi commit kafka callback

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6121:

Status: Patch Available  (was: In Progress)

> Log the exception in the hudi commit kafka callback
> ---
>
> Key: HUDI-6121
> URL: https://issues.apache.org/jira/browse/HUDI-6121
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: ziqiao
>Priority: Minor
>  Labels: pull-request-available
>
> Right now the kafka callback does not log the exception, and it would be hard 
> to find out why the kafka message sent failed without log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6122:

Fix Version/s: 0.14.0

> Call clean/compaction support custom options
> 
>
> Key: HUDI-6122
> URL: https://issues.apache.org/jira/browse/HUDI-6122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6122:

Status: In Progress  (was: Open)

> Call clean/compaction support custom options
> 
>
> Key: HUDI-6122
> URL: https://issues.apache.org/jira/browse/HUDI-6122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6122:

Component/s: spark-sql

> Call clean/compaction support custom options
> 
>
> Key: HUDI-6122
> URL: https://issues.apache.org/jira/browse/HUDI-6122
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6122) Call clean/compaction support custom options

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6122:

Status: Patch Available  (was: In Progress)

> Call clean/compaction support custom options
> 
>
> Key: HUDI-6122
> URL: https://issues.apache.org/jira/browse/HUDI-6122
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6124) Optimize exception message for init HoodieCatalogTable assert

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6124.
---
Resolution: Fixed

> Optimize exception message for init HoodieCatalogTable assert
> -
>
> Key: HUDI-6124
> URL: https://issues.apache.org/jira/browse/HUDI-6124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6124) Optimize exception message for init HoodieCatalogTable assert

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6124:

Fix Version/s: 0.14.0

> Optimize exception message for init HoodieCatalogTable assert
> -
>
> Key: HUDI-6124
> URL: https://issues.apache.org/jira/browse/HUDI-6124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8622: [HUDI-6163] Add PR size labeler

2023-05-02 Thread via GitHub



hudi-bot commented on PR #8622:
URL: https://github.com/apache/hudi/pull/8622#issuecomment-1532079334

   
   ## CI report:
   
   * fed3fee57fb882abd972995f24182b8598b6c576 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16797)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6125) Disable test testInsertDatasetWIthTimelineTimezoneUTC

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-6125.
---
Resolution: Invalid

> Disable test testInsertDatasetWIthTimelineTimezoneUTC
> -
>
> Key: HUDI-6125
> URL: https://issues.apache.org/jira/browse/HUDI-6125
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: Screenshot 2023-04-22 at 14.39.01.png
>
>
> The test testInsertDatasetWIthTimelineTimezoneUTC causes the GH Java CI to 
> timeout for Spark tests.  We need to disable this test to unblock CI while 
> investigating.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6125) Disable test testInsertDatasetWIthTimelineTimezoneUTC

2023-05-02 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718707#comment-17718707
 ] 

Ethan Guo commented on HUDI-6125:
-

This is no longer needed as the offending commit is reverted 
(https://github.com/apache/hudi/pull/8549).

> Disable test testInsertDatasetWIthTimelineTimezoneUTC
> -
>
> Key: HUDI-6125
> URL: https://issues.apache.org/jira/browse/HUDI-6125
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.14.0
>
> Attachments: Screenshot 2023-04-22 at 14.39.01.png
>
>
> The test testInsertDatasetWIthTimelineTimezoneUTC causes the GH Java CI to 
> timeout for Spark tests.  We need to disable this test to unblock CI while 
> investigating.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6124) Optimize exception message for init HoodieCatalogTable assert

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6124:

Status: Patch Available  (was: In Progress)

> Optimize exception message for init HoodieCatalogTable assert
> -
>
> Key: HUDI-6124
> URL: https://issues.apache.org/jira/browse/HUDI-6124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6124) Optimize exception message for init HoodieCatalogTable assert

2023-05-02 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6124:

Status: In Progress  (was: Open)

> Optimize exception message for init HoodieCatalogTable assert
> -
>
> Key: HUDI-6124
> URL: https://issues.apache.org/jira/browse/HUDI-6124
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: xiaoping.huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 >

1 - 100 of 322 matches

Mail list logo