[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866540631


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieBucketIndex.java:
##
@@ -37,28 +36,30 @@
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 
-import java.util.HashMap;
-import java.util.Map;
+import java.util.List;
 
 /**
  * Hash indexing mechanism.
  */
-public class HoodieBucketIndex extends HoodieIndex {
+public abstract class HoodieBucketIndex extends HoodieIndex {
 
-  private static final Logger LOG =  
LogManager.getLogger(HoodieBucketIndex.class);
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBucketIndex.class);
 
-  private final int numBuckets;
+  protected final int numBuckets;
+  protected final String indexKeyFields;

Review Comment:
   Nice point! Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866540796


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieSimpleBucketIndex.java:
##
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bucket;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndexUtils;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Simple bucket index implementation, with fixed bucket number.
+ */
+public class HoodieSimpleBucketIndex extends HoodieBucketIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleBucketIndex.class);
+
+  /**
+   * Mapping from partitionPath -> bucketId -> fileInfo
+   */
+  private Map> 
partitionPathFileIDList;
+
+  public HoodieSimpleBucketIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  private Map 
loadPartitionBucketIdFileIdMapping(
+  HoodieTable hoodieTable,
+  String partition) {
+// bucketId -> fileIds
+Map bucketIdToFileIdMapping = new 
HashMap<>();
+hoodieTable.getMetaClient().reloadActiveTimeline();
+HoodieIndexUtils
+.getLatestBaseFilesForPartition(partition, hoodieTable)
+.forEach(file -> {
+  String fileId = file.getFileId();
+  String commitTime = file.getCommitTime();
+  int bucketId = BucketIdentifier.bucketIdFromFileId(fileId);
+  if (!bucketIdToFileIdMapping.containsKey(bucketId)) {
+bucketIdToFileIdMapping.put(bucketId, new 
HoodieRecordLocation(commitTime, fileId));
+  } else {
+// Check if bucket data is valid
+throw new HoodieIOException("Find multiple files at partition 
path="
++ partition + " belongs to the same bucket id = " + bucketId);
+  }
+});
+return bucketIdToFileIdMapping;
+  }
+
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  @Override
+  protected void initialize(HoodieTable table, List partitions) {
+partitionPathFileIDList = new HashMap<>();
+partitions.forEach(p -> partitionPathFileIDList.put(p, 
loadPartitionBucketIdFileIdMapping(table, p)));

Review Comment:
   Yes! Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866540309


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java:
##
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bucket;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.data.HoodieData;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Predicate;
+
+/**
+ * Consistent hashing bucket index implementation, with auto-adjust bucket 
number.
+ * NOTE: bucket resizing is triggered by clustering.
+ */
+public class HoodieSparkConsistentBucketIndex extends HoodieBucketIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSparkConsistentBucketIndex.class);
+
+  private Map partitionToIdentifier;
+
+  public HoodieSparkConsistentBucketIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public HoodieData updateLocation(HoodieData 
writeStatuses, HoodieEngineContext context, HoodieTable hoodieTable) throws 
HoodieIndexException {
+return writeStatuses;
+  }
+
+  /**
+   * Do nothing.
+   * A failed write may create a hashing metadata for a partition. In this 
case, we still do nothing when rolling back
+   * the failed write. Because the hashing metadata created by a writer must 
have 00 timestamp and can be viewed
+   * as the initialization of a partition rather than as a part of the failed 
write.
+   *
+   * @param instantTime

Review Comment:
   Sure, will also check other comments across this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866540047


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieConsistentHashingMetadata.java:
##
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+
+import com.fasterxml.jackson.annotation.JsonAutoDetect;
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.PropertyAccessor;
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * All the metadata that is used for consistent hashing bucket index
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
+public class HoodieConsistentHashingMetadata implements Serializable {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieConsistentHashingMetadata.class);
+  /**
+   * Upper-bound of the hash value
+   */
+  public static final int HASH_VALUE_MASK = Integer.MAX_VALUE;
+  public static final String HASHING_METADATA_FILE_SUFFIX = ".hashing_meta";
+  private static final ObjectMapper MAPPER = new ObjectMapper();
+
+  static {
+MAPPER.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
+MAPPER.setVisibility(PropertyAccessor.FIELD, 
JsonAutoDetect.Visibility.ANY);
+  }
+
+  private final short version;
+  private final String partitionPath;
+  private final String instant;
+  private final int numBuckets;
+  private final int seqNo;
+  private final List nodes;
+
+  @JsonCreator
+  public HoodieConsistentHashingMetadata(@JsonProperty("version") short 
version, @JsonProperty("partitionPath") String partitionPath,
+ @JsonProperty("instant") String 
instant, @JsonProperty("numBuckets") int numBuckets,
+ @JsonProperty("seqNo") int seqNo, 
@JsonProperty("nodes") List nodes) {
+this.version = version;
+this.partitionPath = partitionPath;
+this.instant = instant;
+this.numBuckets = numBuckets;
+this.seqNo = seqNo;
+this.nodes = nodes;
+  }
+
+  public HoodieConsistentHashingMetadata(String partitionPath, int numBuckets) 
{
+this((short) 0, partitionPath, HoodieTimeline.INIT_INSTANT_TS, numBuckets, 
0);
+  }
+
+  /**
+   * Construct default metadata with all bucket's file group uuid initialized
+   *
+   * @param partitionPath
+   * @param numBuckets
+   */
+  private HoodieConsistentHashingMetadata(short version, String partitionPath, 
String instant, int numBuckets, int seqNo) {
+this.version = version;
+this.partitionPath = partitionPath;
+this.instant = instant;
+this.numBuckets = numBuckets;
+this.seqNo = seqNo;
+
+nodes = new ArrayList<>();
+long step = ((long) HASH_VALUE_MASK + numBuckets - 1) / numBuckets;
+for (int i = 1; i <= numBuckets; ++i) {

Review Comment:
   Could you elaborate a little bit more on this comment? I am not seeing 2 
ctors duplicating each other here. 
   
   ps. the links seems not related to this PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866538016


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIdentifier.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bucket;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.hash.HashID;
+
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+public class ConsistentBucketIdentifier extends BucketIdentifier {
+
+  /**
+   * Hashing metadata of a partition
+   */
+  private final HoodieConsistentHashingMetadata metadata;
+  /**
+   * In-memory structure to speed up ring mapping (hashing value -> hashing 
node)
+   */
+  private final TreeMap ring;
+  /**
+   * Mapping from fileId -> hashing node
+   */
+  private final Map fileIdToBucket;
+
+  public ConsistentBucketIdentifier(HoodieConsistentHashingMetadata metadata) {
+this.metadata = metadata;
+this.fileIdToBucket = new HashMap<>();
+this.ring = new TreeMap<>();
+initialize();
+  }
+
+  public Collection getNodes() {
+return ring.values();
+  }
+
+  public HoodieConsistentHashingMetadata getMetadata() {
+return metadata;
+  }
+
+  public int getNumBuckets() {
+return getNodes().size();
+  }
+
+  /**
+   * Get bucket of the given file group
+   *
+   * @param fileId the file group id. NOTE: not filePfx (i.e., uuid)
+   * @return
+   */
+  public ConsistentHashingNode getBucketByFileId(String fileId) {
+return fileIdToBucket.get(fileId);
+  }
+
+  public ConsistentHashingNode getBucket(HoodieKey hoodieKey, String 
indexKeyFields) {
+return getBucket(getHashKeys(hoodieKey, indexKeyFields));
+  }
+
+  protected ConsistentHashingNode getBucket(List hashKeys) {
+int hashValue = 0;
+for (int i = 0; i < hashKeys.size(); ++i) {
+  hashValue = HashID.getXXHash32(hashKeys.get(i), hashValue);

Review Comment:
   I followed the standard List.hashCode() implementation, which make 
sequential hash function call to each element. Concatenating keys also 
introduce additional costs (meanly memory).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866537070


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIdentifier.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bucket;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.hash.HashID;
+
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+public class ConsistentBucketIdentifier extends BucketIdentifier {
+
+  /**
+   * Hashing metadata of a partition
+   */
+  private final HoodieConsistentHashingMetadata metadata;
+  /**
+   * In-memory structure to speed up ring mapping (hashing value -> hashing 
node)
+   */
+  private final TreeMap ring;
+  /**
+   * Mapping from fileId -> hashing node
+   */
+  private final Map fileIdToBucket;
+
+  public ConsistentBucketIdentifier(HoodieConsistentHashingMetadata metadata) {
+this.metadata = metadata;
+this.fileIdToBucket = new HashMap<>();
+this.ring = new TreeMap<>();
+initialize();
+  }
+
+  public Collection getNodes() {
+return ring.values();
+  }
+
+  public HoodieConsistentHashingMetadata getMetadata() {
+return metadata;
+  }
+
+  public int getNumBuckets() {
+return getNodes().size();

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866536775


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java:
##
@@ -22,41 +22,50 @@
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
 
+import java.io.Serializable;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.List;
 import java.util.Map;
 import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 
-public class BucketIdentifier {
-  // compatible with the spark bucket name
+public class BucketIdentifier implements Serializable {
+  // Compatible with the spark bucket name
   private static final Pattern BUCKET_NAME = 
Pattern.compile(".*_(\\d+)(?:\\..*)?$");
 
   public static int getBucketId(HoodieRecord record, String indexKeyFields, 
int numBuckets) {
 return getBucketId(record.getKey(), indexKeyFields, numBuckets);
   }
 
   public static int getBucketId(HoodieKey hoodieKey, String indexKeyFields, 
int numBuckets) {
-return getBucketId(hoodieKey.getRecordKey(), indexKeyFields, numBuckets);
+return (getHashKeys(hoodieKey, indexKeyFields).hashCode() & 
Integer.MAX_VALUE) % numBuckets;
   }
 
   public static int getBucketId(String recordKey, String indexKeyFields, int 
numBuckets) {
-List hashKeyFields;
+return (getHashKeys(recordKey, indexKeyFields).hashCode() & 
Integer.MAX_VALUE) % numBuckets;
+  }
+
+  public static List getHashKeys(HoodieKey hoodieKey, String 
indexKeyFields) {
+return getHashKeys(hoodieKey.getRecordKey(), indexKeyFields);
+  }
+
+  protected static List getHashKeys(String recordKey, String 
indexKeyFields) {
+List hashKeys;
 if (!recordKey.contains(":")) {
-  hashKeyFields = Collections.singletonList(recordKey);
+  hashKeys = Collections.singletonList(recordKey);
 } else {
   Map recordKeyPairs = Arrays.stream(recordKey.split(","))
   .map(p -> p.split(":"))
   .collect(Collectors.toMap(p -> p[0], p -> p[1]));
-  hashKeyFields = Arrays.stream(indexKeyFields.split(","))
+  hashKeys = Arrays.stream(indexKeyFields.split(","))
   .map(f -> recordKeyPairs.get(f))
   .collect(Collectors.toList());
 }
-return (hashKeyFields.hashCode() & Integer.MAX_VALUE) % numBuckets;
+return hashKeys;
   }
 
-  // only for test
+  // Only for test

Review Comment:
   The code is from the original simple bucket index implementation. I just 
re-organized the code to remove this test-only function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


YuweiXiao commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866536047


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/LazyIterableIterator.java:
##
@@ -45,7 +45,9 @@ public LazyIterableIterator(Iterator in) {
   /**
* Called once, before any elements are processed.
*/
-  protected abstract void start();
+  protected void start() {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] jtmzheng opened a new issue, #5514: [SUPPORT] Read optimized query on MOR table lists files without any Spark action

2022-05-05 Thread GitBox


jtmzheng opened a new issue, #5514:
URL: https://github.com/apache/hudi/issues/5514

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   I'm seeing some unexpected behavior where a `read_optimized` Spark query on 
a MOR table is taking ~30 minutes without any action (this is on Hudi 0.9.0 
without metadata table enabled) :
   ```
   start_time = datetime.now()
   read_options = {"hoodie.datasource.query.type": "read_optimized"}
   df = (
   spark.read.format("hudi")
   .options(**read_options)
   .load("{table_s3_path}")
   )
   print(f"Elapsed: {datetime.now() - start_time}")
   ```
   
   ```
   Elapsed: 0:34:38.293859
   ```
   
   A snapshot query returns in ~ 5s (as expected) since there is no action like 
count, collect, show, etc. This also doesn't seem to affect COW tables.
   
   Looking at the Spark UI curiously showed jobs being created referencing 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java#L73.
   
   I got help from a user on Hudi Slack: 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1651784954682329 who pointed 
to:
   
   ```
   int parallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
partitionPaths.size());
   
   List> partitionToFiles = 
engineContext.map(partitionPaths, partitionPathStr -> {
 Path partitionPath = new Path(partitionPathStr);
 FileSystem fs = partitionPath.getFileSystem(hadoopConf.get());
 return Pair.of(partitionPathStr, 
FSUtils.getAllDataFilesInPartition(fs, partitionPath));
   }, parallelism);
   ```
   
   being the culprit where the read optimized query was listing the files in 
the table (there are a lot of files so it's not surprising this takes a while 
since it's not doing any partition pruning). Link: 
https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java#L119
 
   
   Can anyone provide insight on what's going on? What can I do to work around 
this?
   
   
   Steps to reproduce the behavior:
   
   1. Create a MOR table with some test data
   2. Query the table through Spark using a read optimized query **without** 
any action
   3. Verify Spark jobs are created that listed the files through the Spark UI
   
   **Expected behavior**
   
   The read optimized query does not list the files until an action (eg. if you 
query a specific partition it should only list the files in that partition).
   
   **Environment Description**
   
   * Hudi version : 0.9.0 (EMR)
   
   * Spark version : 3.1.2
   
   * Hive version : 3.1.2
   
   * Hadoop version : Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   N/A
   
   **Stacktrace**
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] aiwenmo commented on issue #5513: [SUPPORT] Sync realtime whole mysql database to hudi failed when using flink datastream api

2022-05-05 Thread GitBox


aiwenmo commented on issue #5513:
URL: https://github.com/apache/hudi/issues/5513#issuecomment-1119280273

   > You need to set up the key generator clazz correctly.
   
   thx. Your method is also OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-05-05 Thread GitBox


hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1119276503

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * b7ed5a5b237814ee2f0266b0ec1345f23d69d94a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8425)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-05-05 Thread GitBox


hudi-bot commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1119275155

   
   ## CI report:
   
   * 8c6f6e19940ce7ac04dfcfce52da3ccdaf3a8b0f UNKNOWN
   * b7ed5a5b237814ee2f0266b0ec1345f23d69d94a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8425)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5513: [SUPPORT] Sync realtime whole mysql database to hudi failed when using flink datastream api

2022-05-05 Thread GitBox


danny0405 commented on issue #5513:
URL: https://github.com/apache/hudi/issues/5513#issuecomment-1119274099

   You need to set up the key generator clazz correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on pull request #5402: [WIP] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default

2022-05-05 Thread GitBox


yihua commented on PR #5402:
URL: https://github.com/apache/hudi/pull/5402#issuecomment-1119273974

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5382: [SUPPORT] org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim

2022-05-05 Thread GitBox


danny0405 commented on issue #5382:
URL: https://github.com/apache/hudi/issues/5382#issuecomment-1119273063

   Can you ask for help in the dingTalk group, some people might solve this 
problem already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-05-05 Thread GitBox


danny0405 commented on code in PR #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r866499815


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1301,4 +1359,33 @@ public void close() {
 this.heartbeatClient.stop();
 this.txnManager.close();
   }
+
+  private void setWriteTimer(HoodieTable table) {
+String commitType = table.getMetaClient().getCommitActionType();
+if (commitType.equals(HoodieTimeline.COMMIT_ACTION)) {
+  writeTimer = metrics.getCommitCtx();
+} else if (commitType.equals(HoodieTimeline.DELTA_COMMIT_ACTION)) {
+  writeTimer = metrics.getDeltaCommitCtx();
+}
+  }
+
+  private void tryUpgrade(HoodieTableMetaClient metaClient, Option 
instantTime) {
+UpgradeDowngrade upgradeDowngrade =
+new UpgradeDowngrade(metaClient, config, context, 
upgradeDowngradeHelper);
+
+if 
(upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {
+  // Ensure no inflight commits by setting EAGER policy and explicitly 
cleaning all failed commits
+  List instantsToRollback = getInstantsToRollback(metaClient, 
HoodieFailedWritesCleaningPolicy.EAGER, instantTime);
+
+  Map> pendingRollbacks = 
getPendingRollbackInfos(metaClient);

Review Comment:
   > sure table is in a consistent state (no leftovers of failed commits) when 
we start the upgrade process.
   
   My confusion is why we need to do that for upgrade ? 
   Is there any restriction here for correctness ? The code before the patch 
does not do so.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] parisni commented on issue #5484: [SUPPORT] Hive Sync + AWS Data Catalog failling with Hudi 0.11.0

2022-05-05 Thread GitBox


parisni commented on issue #5484:
URL: https://github.com/apache/hudi/issues/5484#issuecomment-1119271257

   what about removing that `'hoodie.meta.sync.client.tool.class': 
'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool'` ? This will work as 0.10 
with the regular hive sync connector, which just work with glue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5509: [HUDI-4041] compact with precombineKey in RealtimeCompactedRecordRead…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5509:
URL: https://github.com/apache/hudi/pull/5509#issuecomment-1119255595

   
   ## CI report:
   
   * 85be21c2ec99670eb9e0e697259e407d78bf4524 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5509: [HUDI-4041] compact with precombineKey in RealtimeCompactedRecordRead…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5509:
URL: https://github.com/apache/hudi/pull/5509#issuecomment-1119253350

   
   ## CI report:
   
   * 99e207f779a258ff32c57c9d1b962d772213c081 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8448)
 
   * 85be21c2ec99670eb9e0e697259e407d78bf4524 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8461)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5509: [HUDI-4041] compact with precombineKey in RealtimeCompactedRecordRead…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5509:
URL: https://github.com/apache/hudi/pull/5509#issuecomment-1119252227

   
   ## CI report:
   
   * 99e207f779a258ff32c57c9d1b962d772213c081 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8448)
 
   * 85be21c2ec99670eb9e0e697259e407d78bf4524 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5073: [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode

2022-05-05 Thread GitBox


hudi-bot commented on PR #5073:
URL: https://github.com/apache/hudi/pull/5073#issuecomment-1119208823

   
   ## CI report:
   
   * a1322fbeb11fe5bb71cd5d70f13147bf8a036996 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] aiwenmo opened a new issue, #5513: [SUPPORT] Sync realtime whole mysql database to hudi failed when using flink datastream api

2022-05-05 Thread GitBox


aiwenmo opened a new issue, #5513:
URL: https://github.com/apache/hudi/issues/5513

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   org.apache.hudi.org.apache.avro.AvroRuntimeException: Not a valid schema 
field.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.Some relevant codes are shown below.
   
   
   `
   public void addSink(
   StreamExecutionEnvironment env,
   DataStream rowDataDataStream,
   Table table,
   List columnNameList,
   List columnTypeList) {
   
   final String[] columnNames = columnNameList.toArray(new 
String[columnNameList.size()]);
   final LogicalType[] columnTypes = columnTypeList.toArray(new 
LogicalType[columnTypeList.size()]);
   
   final String tableName = getSinkTableName(table);
   
   Integer parallelism = 1;
   boolean isMor = true;
   Map sink = config.getSink();
   Configuration configuration = Configuration.fromMap(sink);
   if (sink.containsKey("parallelism")) {
   parallelism = Integer.valueOf(sink.get("parallelism"));
   }
   if (configuration.contains(FlinkOptions.PATH)) {
   configuration.set(FlinkOptions.PATH, 
configuration.getValue(FlinkOptions.PATH) + tableName);
   }
   if (sink.containsKey(FlinkOptions.TABLE_TYPE.key())) {
   isMor = 
HoodieTableType.MERGE_ON_READ.name().equals(sink.get(FlinkOptions.TABLE_TYPE.key()));
   }
   configuration.set(FlinkOptions.TABLE_NAME, tableName);
   configuration.set(FlinkOptions.HIVE_SYNC_DB, 
getSinkSchemaName(table));
   configuration.set(FlinkOptions.HIVE_SYNC_TABLE, tableName);
   
   long ckpTimeout = rowDataDataStream.getExecutionEnvironment()
   .getCheckpointConfig().getCheckpointTimeout();
   configuration.setLong(FlinkOptions.WRITE_COMMIT_ACK_TIMEOUT, 
ckpTimeout);
   
   RowType rowType = RowType.of(false, columnTypes, columnNames);
   configuration.setString(FlinkOptions.SOURCE_AVRO_SCHEMA,
   AvroSchemaConverter.convertToSchema(rowType).toString());
   
   // bulk_insert mode
   final String writeOperation = 
configuration.get(FlinkOptions.OPERATION);
   if (WriteOperationType.fromValue(writeOperation) == 
WriteOperationType.BULK_INSERT) {
   Pipelines.bulkInsert(configuration, rowType, rowDataDataStream);
   } else
   // Append mode
   if (OptionsResolver.isAppendMode(configuration)) {
   Pipelines.append(configuration, rowType, rowDataDataStream);
   } else {
   
   DataStream hoodieRecordDataStream = 
Pipelines.bootstrap(configuration, rowType, parallelism, rowDataDataStream);
   DataStream pipeline = 
Pipelines.hoodieStreamWrite(configuration, parallelism, hoodieRecordDataStream);
   
   // compaction
   if (StreamerUtil.needsAsyncCompaction(configuration)) {
   Pipelines.compact(configuration, pipeline);
   } else {
   Pipelines.clean(configuration, pipeline);
   }
   if (isMor) {
   Pipelines.clean(configuration, pipeline);
   Pipelines.compact(configuration, pipeline);
   }
   }
   }
   `
   2. Use dlink to submit task. The SQL is as follows.
   
   ` sql
   
   EXECUTE CDCSOURCE demo WITH (
   'connector' = 'mysql-cdc',
   'hostname' = '127.0.0.1',
   'port' = '3306',
   'username' = 'root',
   'password' = '123456',
   'source.server-time-zone' = 'UTC',
   'checkpoint'='1000',
   'scan.startup.mode'='initial',
   'parallelism'='1',
   'database-name'='data_deal',
   'table-name'='data_deal\.stu,data_deal\.score',
   'sink.connector'='datastream-hudi',
   'sink.path'='hdfs://cluster1/tmp/flink/cdcdata/',
   'sink.hoodie.datasource.write.recordkey.field'='id',
   'sink.hoodie.parquet.max.file.size'='268435456',
   'sink.write.precombine.field'='update_time',
   'sink.write.tasks'='1',
   'sink.write.bucket_assign.tasks'='2',
   'sink.write.precombine'='true',
   'sink.compaction.async.enabled'='true',
   'sink.write.task.max.size'='1024',
   'sink.write.rate.limit'='3000',
   'sink.write.operation'='upsert',
   'sink.table.type'='COPY_ON_WRITE',
   'sink.compaction.tasks'='1',
   'sink.compaction.delta_seconds'='20',
   'sink.compaction.async.enabled'='true',
   'sink.read.streaming.skip_compaction'='true',
   'sink.compaction.delta_commits'='20',
  

[GitHub] [hudi] hudi-bot commented on pull request #5073: [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode

2022-05-05 Thread GitBox


hudi-bot commented on PR #5073:
URL: https://github.com/apache/hudi/pull/5073#issuecomment-1119186216

   
   ## CI report:
   
   * 7042acc09b38acd8741c89ca77e99bdacaa6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8459)
 
   * a1322fbeb11fe5bb71cd5d70f13147bf8a036996 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8460)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5073: [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode

2022-05-05 Thread GitBox


hudi-bot commented on PR #5073:
URL: https://github.com/apache/hudi/pull/5073#issuecomment-1119184975

   
   ## CI report:
   
   * 7042acc09b38acd8741c89ca77e99bdacaa6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8459)
 
   * a1322fbeb11fe5bb71cd5d70f13147bf8a036996 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] lanyuanxiaoyao commented on a diff in pull request #5473: [HUDI-4003] Try to read all the log file to parse schema

2022-05-05 Thread GitBox


lanyuanxiaoyao commented on code in PR #5473:
URL: https://github.com/apache/hudi/pull/5473#discussion_r866425592


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -109,13 +110,18 @@ private MessageType getTableParquetSchemaFromDataFile() {
   // Determine the file format based on the file name, and then 
extract schema from it.
   if (instantAndCommitMetadata.isPresent()) {
 HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-String filePath = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
-if 
(filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
-  // this is a log file
-  return readSchemaFromLogFile(new Path(filePath));
-} else {
-  return readSchemaFromBaseFile(filePath);
+Iterator filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
+MessageType type = null;
+while (filePaths.hasNext() && type == null) {

Review Comment:
   Ok. Thanks for reviewing, and the new commit has fixed it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4048) Upgrade Hudi version in presto-hive

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4048:
--
Status: In Progress  (was: Open)

> Upgrade Hudi version in presto-hive
> ---
>
> Key: HUDI-4048
> URL: https://issues.apache.org/jira/browse/HUDI-4048
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4048) Upgrade Hudi version in presto-hive

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4048:
--
Status: Patch Available  (was: In Progress)

> Upgrade Hudi version in presto-hive
> ---
>
> Key: HUDI-4048
> URL: https://issues.apache.org/jira/browse/HUDI-4048
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3960) Update HudiRealtimeSplitConverter to correctly instantiate HoodieRealtimeFileSplit

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3960:
--
Status: Patch Available  (was: In Progress)

> Update HudiRealtimeSplitConverter to correctly instantiate 
> HoodieRealtimeFileSplit
> --
>
> Key: HUDI-3960
> URL: https://issues.apache.org/jira/browse/HUDI-3960
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> The constructor for HoodieRealtimeFileSplit changed due to 
> [https://github.com/apache/hudi/pull/4743/files#diff-22bc252cee3012a8afa882f0dab4304dc6892a950a44af518395d678dd413330]
> HudiRealtimeSplitConverter in Presto relies on HoodieRealtimeFileSplit to 
> fetch delta log files.
> After the release of Hudi 0.11, we need to upgrade hudi-presto-bundle in 
> presto. If not, snapshot queries on MOR table will break due to
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit.(Lorg/apache/hadoop/mapred/FileSplit;Ljava/lang/String;Ljava/util/List;Ljava/lang/String;Lorg/apache/hudi/common/util/Option;)V
>   at 
> com.facebook.presto.hive.util.HudiRealtimeSplitConverter.recreateFileSplitWithCustomInfo(HudiRealtimeSplitConverter.java:70)
>   at 
> com.facebook.presto.hive.util.CustomSplitConversionUtils.recreateSplitWithCustomInfo(CustomSplitConversionUtils.java:58)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:247)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3960) Update HudiRealtimeSplitConverter to correctly instantiate HoodieRealtimeFileSplit

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3960:
--
Status: In Progress  (was: Open)

> Update HudiRealtimeSplitConverter to correctly instantiate 
> HoodieRealtimeFileSplit
> --
>
> Key: HUDI-3960
> URL: https://issues.apache.org/jira/browse/HUDI-3960
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> The constructor for HoodieRealtimeFileSplit changed due to 
> [https://github.com/apache/hudi/pull/4743/files#diff-22bc252cee3012a8afa882f0dab4304dc6892a950a44af518395d678dd413330]
> HudiRealtimeSplitConverter in Presto relies on HoodieRealtimeFileSplit to 
> fetch delta log files.
> After the release of Hudi 0.11, we need to upgrade hudi-presto-bundle in 
> presto. If not, snapshot queries on MOR table will break due to
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit.(Lorg/apache/hadoop/mapred/FileSplit;Ljava/lang/String;Ljava/util/List;Ljava/lang/String;Lorg/apache/hudi/common/util/Option;)V
>   at 
> com.facebook.presto.hive.util.HudiRealtimeSplitConverter.recreateFileSplitWithCustomInfo(HudiRealtimeSplitConverter.java:70)
>   at 
> com.facebook.presto.hive.util.CustomSplitConversionUtils.recreateSplitWithCustomInfo(CustomSplitConversionUtils.java:58)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:247)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HUDI-4031) Avoid clustering update handling when clustering is disabled

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-4031.
---

> Avoid clustering update handling when clustering is disabled
> 
>
> Key: HUDI-4031
> URL: https://issues.apache.org/jira/browse/HUDI-4031
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> We call distinct().collectAsList() on RDD to determine conflicting filegroups 
> while handling updates with clustering. See 
> [https://github.com/apache/hudi/blob/6af1ff7a663da57438db8847ca0dfda5a6e381f5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/update/strategy/BaseSparkUpdateStrategy.java#L50]
>  
> While this is needed when clustering is enabled with regular writer, it can 
> be avoided when clustering is disabled and there are no pending 
> replacecommits.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3960) Update HudiRealtimeSplitConverter to correctly instantiate HoodieRealtimeFileSplit

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3960:
--
Story Points: 1

> Update HudiRealtimeSplitConverter to correctly instantiate 
> HoodieRealtimeFileSplit
> --
>
> Key: HUDI-3960
> URL: https://issues.apache.org/jira/browse/HUDI-3960
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> The constructor for HoodieRealtimeFileSplit changed due to 
> [https://github.com/apache/hudi/pull/4743/files#diff-22bc252cee3012a8afa882f0dab4304dc6892a950a44af518395d678dd413330]
> HudiRealtimeSplitConverter in Presto relies on HoodieRealtimeFileSplit to 
> fetch delta log files.
> After the release of Hudi 0.11, we need to upgrade hudi-presto-bundle in 
> presto. If not, snapshot queries on MOR table will break due to
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit.(Lorg/apache/hadoop/mapred/FileSplit;Ljava/lang/String;Ljava/util/List;Ljava/lang/String;Lorg/apache/hudi/common/util/Option;)V
>   at 
> com.facebook.presto.hive.util.HudiRealtimeSplitConverter.recreateFileSplitWithCustomInfo(HudiRealtimeSplitConverter.java:70)
>   at 
> com.facebook.presto.hive.util.CustomSplitConversionUtils.recreateSplitWithCustomInfo(CustomSplitConversionUtils.java:58)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:247)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4050) Upgrade Hudi version to 0.11.0 in the connector

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4050:
--
Sprint: 2022/05/02

> Upgrade Hudi version to 0.11.0 in the connector
> ---
>
> Key: HUDI-4050
> URL: https://issues.apache.org/jira/browse/HUDI-4050
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4050) Upgrade Hudi version to 0.11.0 in the connector

2022-05-05 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-4050:
-

 Summary: Upgrade Hudi version to 0.11.0 in the connector
 Key: HUDI-4050
 URL: https://issues.apache.org/jira/browse/HUDI-4050
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit
Assignee: Sagar Sumit






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4049) Upgrade Hudi version in the connector

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4049:
--
Sprint: 2022/05/02

> Upgrade Hudi version in the connector
> -
>
> Key: HUDI-4049
> URL: https://issues.apache.org/jira/browse/HUDI-4049
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4048) Upgrade Hudi version in presto-hive

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4048:
--
Sprint: 2022/05/02

> Upgrade Hudi version in presto-hive
> ---
>
> Key: HUDI-4048
> URL: https://issues.apache.org/jira/browse/HUDI-4048
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3960) Update HudiRealtimeSplitConverter to correctly instantiate HoodieRealtimeFileSplit

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-3960:
--
Sprint: 2022/05/02

> Update HudiRealtimeSplitConverter to correctly instantiate 
> HoodieRealtimeFileSplit
> --
>
> Key: HUDI-3960
> URL: https://issues.apache.org/jira/browse/HUDI-3960
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> The constructor for HoodieRealtimeFileSplit changed due to 
> [https://github.com/apache/hudi/pull/4743/files#diff-22bc252cee3012a8afa882f0dab4304dc6892a950a44af518395d678dd413330]
> HudiRealtimeSplitConverter in Presto relies on HoodieRealtimeFileSplit to 
> fetch delta log files.
> After the release of Hudi 0.11, we need to upgrade hudi-presto-bundle in 
> presto. If not, snapshot queries on MOR table will break due to
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit.(Lorg/apache/hadoop/mapred/FileSplit;Ljava/lang/String;Ljava/util/List;Ljava/lang/String;Lorg/apache/hudi/common/util/Option;)V
>   at 
> com.facebook.presto.hive.util.HudiRealtimeSplitConverter.recreateFileSplitWithCustomInfo(HudiRealtimeSplitConverter.java:70)
>   at 
> com.facebook.presto.hive.util.CustomSplitConversionUtils.recreateSplitWithCustomInfo(CustomSplitConversionUtils.java:58)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:247)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4048) Upgrade Hudi version in presto-hive

2022-05-05 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-4048:
-

 Summary: Upgrade Hudi version in presto-hive
 Key: HUDI-4048
 URL: https://issues.apache.org/jira/browse/HUDI-4048
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit
Assignee: Sagar Sumit






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4049) Upgrade Hudi version in the connector

2022-05-05 Thread Sagar Sumit (Jira)
Sagar Sumit created HUDI-4049:
-

 Summary: Upgrade Hudi version in the connector
 Key: HUDI-4049
 URL: https://issues.apache.org/jira/browse/HUDI-4049
 Project: Apache Hudi
  Issue Type: Task
Reporter: Sagar Sumit
Assignee: Sagar Sumit






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-3960) Update HudiRealtimeSplitConverter to correctly instantiate HoodieRealtimeFileSplit

2022-05-05 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-3960:
-

Assignee: Sagar Sumit

> Update HudiRealtimeSplitConverter to correctly instantiate 
> HoodieRealtimeFileSplit
> --
>
> Key: HUDI-3960
> URL: https://issues.apache.org/jira/browse/HUDI-3960
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Critical
> Fix For: 0.11.1
>
>
> The constructor for HoodieRealtimeFileSplit changed due to 
> [https://github.com/apache/hudi/pull/4743/files#diff-22bc252cee3012a8afa882f0dab4304dc6892a950a44af518395d678dd413330]
> HudiRealtimeSplitConverter in Presto relies on HoodieRealtimeFileSplit to 
> fetch delta log files.
> After the release of Hudi 0.11, we need to upgrade hudi-presto-bundle in 
> presto. If not, snapshot queries on MOR table will break due to
> {code:java}
> java.lang.NoSuchMethodError: 
> org.apache.hudi.hadoop.realtime.HoodieRealtimeFileSplit.(Lorg/apache/hadoop/mapred/FileSplit;Ljava/lang/String;Ljava/util/List;Ljava/lang/String;Lorg/apache/hudi/common/util/Option;)V
>   at 
> com.facebook.presto.hive.util.HudiRealtimeSplitConverter.recreateFileSplitWithCustomInfo(HudiRealtimeSplitConverter.java:70)
>   at 
> com.facebook.presto.hive.util.CustomSplitConversionUtils.recreateSplitWithCustomInfo(CustomSplitConversionUtils.java:58)
>   at 
> com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:247)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>   at 
> com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5073: [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode

2022-05-05 Thread GitBox


hudi-bot commented on PR #5073:
URL: https://github.com/apache/hudi/pull/5073#issuecomment-1119162871

   
   ## CI report:
   
   * 2e170509cffd77d4124ecbe337cd018c96a621fd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8423)
 
   * 7042acc09b38acd8741c89ca77e99bdacaa6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8459)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1119161967

   
   ## CI report:
   
   * c7324b8703ec68a4fd57195028d7977ab6862ac5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1119161930

   
   ## CI report:
   
   * 6f9f0539ab0102ff502e7985a453c4dddea6193a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5073: [HUDI-3675] Adding post write termination strategy to deltastreamer continuous mode

2022-05-05 Thread GitBox


hudi-bot commented on PR #5073:
URL: https://github.com/apache/hudi/pull/5073#issuecomment-1119161676

   
   ## CI report:
   
   * 2e170509cffd77d4124ecbe337cd018c96a621fd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8423)
 
   * 7042acc09b38acd8741c89ca77e99bdacaa6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3995) Bulk insert row writer perf improvements

2022-05-05 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3995:
--
Description: 
*EDIT*
**

While investigating, perf hits in the Bulk Insert a few issues were found:
 # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
`getParititionKey` for `InternalRow`, leading to invocation of default 
implementation converting row to Avro.
 # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
`InternalRow` into deserialized `Row`

 

  was:
*EDIT*
*-*

While investigating, perf hits in the Bulk Insert a few issues were found:
 # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
`getParititionKey` for `InternalRow`, leading to invocation of default 
implementation converting row to Avro.
 # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
`InternalRow` into deserialized `Row`

 


> Bulk insert row writer perf improvements
> 
>
> Key: HUDI-3995
> URL: https://issues.apache.org/jira/browse/HUDI-3995
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> *EDIT*
> **
> While investigating, perf hits in the Bulk Insert a few issues were found:
>  # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
> `getParititionKey` for `InternalRow`, leading to invocation of default 
> implementation converting row to Avro.
>  # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
> `InternalRow` into deserialized `Row`
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3995) Bulk insert row writer perf improvements

2022-05-05 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3995:
--
Description: 
*EDIT*
*-*

While investigating, perf hits in the Bulk Insert a few issues were found:
 # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
`getParititionKey` for `InternalRow`, leading to invocation of default 
implementation converting row to Avro.
 # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
`InternalRow` into deserialized `Row`

 

  was:
While investigating, perf hits in the Bulk Insert a few issues were found:
 # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
`getParititionKey` for `InternalRow`, leading to invocation of default 
implementation converting row to Avro.
 # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
`InternalRow` into deserialized `Row`

 


> Bulk insert row writer perf improvements
> 
>
> Key: HUDI-3995
> URL: https://issues.apache.org/jira/browse/HUDI-3995
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> *EDIT*
> *-*
> While investigating, perf hits in the Bulk Insert a few issues were found:
>  # NonPartitionedKeyGenerator does not implement `getRecordKey`, 
> `getParititionKey` for `InternalRow`, leading to invocation of default 
> implementation converting row to Avro.
>  # HUDI-3993: Using UDF to fetch record keys, similarly has to deserialize 
> `InternalRow` into deserialized `Row`
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4036) Investigate whether meta fields could be omitted completely

2022-05-05 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4036:
--
Epic Link: HUDI-3249

> Investigate whether meta fields could be omitted completely
> ---
>
> Key: HUDI-4036
> URL: https://issues.apache.org/jira/browse/HUDI-4036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.0
>
>
> Currently, even when meta fields are not populated, we still insert 
> empty-string columns to adhere to the expected schema.
> This has a non-trivial overhead of ~20% (relative to just writing dataset as 
> is), since Spark had to essentially "re-write" the original row with 
> prepended new fields.
> We should investigate whether it's feasible to avoid adding empty-string 
> columns completely if meta-fields are disabled.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #4480: [HUDI-3123] consistent hashing index: basic write path (upsert/insert)

2022-05-05 Thread GitBox


alexeykudinkin commented on code in PR #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r866314882


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/BucketIdentifier.java:
##
@@ -22,41 +22,50 @@
 import org.apache.hudi.common.model.HoodieKey;
 import org.apache.hudi.common.model.HoodieRecord;
 
+import java.io.Serializable;
 import java.util.Arrays;
 import java.util.Collections;
 import java.util.List;
 import java.util.Map;
 import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 
-public class BucketIdentifier {
-  // compatible with the spark bucket name
+public class BucketIdentifier implements Serializable {
+  // Compatible with the spark bucket name
   private static final Pattern BUCKET_NAME = 
Pattern.compile(".*_(\\d+)(?:\\..*)?$");
 
   public static int getBucketId(HoodieRecord record, String indexKeyFields, 
int numBuckets) {
 return getBucketId(record.getKey(), indexKeyFields, numBuckets);
   }
 
   public static int getBucketId(HoodieKey hoodieKey, String indexKeyFields, 
int numBuckets) {
-return getBucketId(hoodieKey.getRecordKey(), indexKeyFields, numBuckets);
+return (getHashKeys(hoodieKey, indexKeyFields).hashCode() & 
Integer.MAX_VALUE) % numBuckets;
   }
 
   public static int getBucketId(String recordKey, String indexKeyFields, int 
numBuckets) {
-List hashKeyFields;
+return (getHashKeys(recordKey, indexKeyFields).hashCode() & 
Integer.MAX_VALUE) % numBuckets;
+  }
+
+  public static List getHashKeys(HoodieKey hoodieKey, String 
indexKeyFields) {
+return getHashKeys(hoodieKey.getRecordKey(), indexKeyFields);
+  }
+
+  protected static List getHashKeys(String recordKey, String 
indexKeyFields) {
+List hashKeys;
 if (!recordKey.contains(":")) {
-  hashKeyFields = Collections.singletonList(recordKey);
+  hashKeys = Collections.singletonList(recordKey);
 } else {
   Map recordKeyPairs = Arrays.stream(recordKey.split(","))
   .map(p -> p.split(":"))
   .collect(Collectors.toMap(p -> p[0], p -> p[1]));
-  hashKeyFields = Arrays.stream(indexKeyFields.split(","))
+  hashKeys = Arrays.stream(indexKeyFields.split(","))
   .map(f -> recordKeyPairs.get(f))
   .collect(Collectors.toList());
 }
-return (hashKeyFields.hashCode() & Integer.MAX_VALUE) % numBuckets;
+return hashKeys;
   }
 
-  // only for test
+  // Only for test

Review Comment:
   Why do we need method used only for tests?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/ConsistentBucketIdentifier.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bucket;
+
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.ConsistentHashingNode;
+import org.apache.hudi.common.model.HoodieConsistentHashingMetadata;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.hash.HashID;
+
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+public class ConsistentBucketIdentifier extends BucketIdentifier {
+
+  /**
+   * Hashing metadata of a partition
+   */
+  private final HoodieConsistentHashingMetadata metadata;
+  /**
+   * In-memory structure to speed up ring mapping (hashing value -> hashing 
node)
+   */
+  private final TreeMap ring;
+  /**
+   * Mapping from fileId -> hashing node
+   */
+  private final Map fileIdToBucket;
+
+  public ConsistentBucketIdentifier(HoodieConsistentHashingMetadata metadata) {
+this.metadata = metadata;
+this.fileIdToBucket = new HashMap<>();
+this.ring = new TreeMap<>();
+initialize();
+  }
+
+  public Collection getNodes() {
+return ring.values();
+  }
+
+  public HoodieConsistentHashingMetadata getMetadata() {
+return metadata;
+  }
+
+  public int getNumBuckets() {
+return getNodes().size();

Review Comment:
   Can do `ring.size()` directly to avoid additional allocations



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBuck

[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1119147087

   
   ## CI report:
   
   * 2acc8007cc153d7d4a228e126ef706e5bb25cfbb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8454)
 
   * c7324b8703ec68a4fd57195028d7977ab6862ac5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8458)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1119145841

   
   ## CI report:
   
   * 2acc8007cc153d7d4a228e126ef706e5bb25cfbb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8454)
 
   * c7324b8703ec68a4fd57195028d7977ab6862ac5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4018:
--
Reviewers: Raymond Xu

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Prepare 5 to 8 minimal set of yamls that can be used against any write mode 
> and against any query engine and table types. 
>  
> For eg:
> lets say we come up with 6 yamls covering all cases. 
> Same set should work for all possible combinations from below categories. 
>  
> Table type: 
> COW/MOR
> Metadata:
> enable/disable
> Dataset type:
> partitioned/non-partitioned
> Write mode:
> delta streamer, spark datasource, spark sql, spark streaming sink
>  
> Query engine: 
> spark datasource, hive, presto, trino
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4018:
--
Reviewers: Raymond Xu  (was: Raymond Xu)

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Prepare 5 to 8 minimal set of yamls that can be used against any write mode 
> and against any query engine and table types. 
>  
> For eg:
> lets say we come up with 6 yamls covering all cases. 
> Same set should work for all possible combinations from below categories. 
>  
> Table type: 
> COW/MOR
> Metadata:
> enable/disable
> Dataset type:
> partitioned/non-partitioned
> Write mode:
> delta streamer, spark datasource, spark sql, spark streaming sink
>  
> Query engine: 
> spark datasource, hive, presto, trino
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3873) 0.11 release blog

2022-05-05 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3873:

Status: Patch Available  (was: In Progress)

> 0.11 release blog
> -
>
> Key: HUDI-3873
> URL: https://issues.apache.org/jira/browse/HUDI-3873
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-05-05 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532554#comment-17532554
 ] 

sivabalan narayanan commented on HUDI-4018:
---

# simple sanity (1 insert, 1 upsert, 1 update(spark-sql), 1 delete. validate.) 
: to assist in catching any bundling issues or basic regression. 
 # non-core write operations: insert override table, insert overwrite, delete 
partitions.
 # Testing immutable data: testing pure bulk_inserts and pure inserts.
 # long running tests (multiple batches of insert, upserts, deletes and 
validation) atleast 100 commits.
 ## we will make cleaner and archival configs aggressive (<10) so that those 
get exercised often during these tests.
 # Savepoint, restore tests.

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Prepare 5 to 8 minimal set of yamls that can be used against any write mode 
> and against any query engine and table types. 
>  
> For eg:
> lets say we come up with 6 yamls covering all cases. 
> Same set should work for all possible combinations from below categories. 
>  
> Table type: 
> COW/MOR
> Metadata:
> enable/disable
> Dataset type:
> partitioned/non-partitioned
> Write mode:
> delta streamer, spark datasource, spark sql, spark streaming sink
>  
> Query engine: 
> spark datasource, hive, presto, trino
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1119111963

   
   ## CI report:
   
   * 8b22298c933375b9af687093cecc68603d7e3c3d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8451)
 
   * 6f9f0539ab0102ff502e7985a453c4dddea6193a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8457)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1119110521

   
   ## CI report:
   
   * 8b22298c933375b9af687093cecc68603d7e3c3d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8451)
 
   * 6f9f0539ab0102ff502e7985a453c4dddea6193a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] vicuna96 commented on issue #4700: [SUPPORT] Adding new column to table is not propagated to Hive via HMS sync mode

2022-05-05 Thread GitBox


vicuna96 commented on issue #4700:
URL: https://github.com/apache/hudi/issues/4700#issuecomment-1119074796

   Hi @xiarixiaoyao , @nsivabalan , is there any known workaround for this? It 
does seem like the problem is that it's trying to use the implementation from 
org.spark-project.hive:hive-metastore:1.2.1.spark2, which is included in Spark 
2.4.7 environment under $SPARK_HOME/jars (and hence in the dataproc image 
mentioned above). Do you have any advice on how to shadow those packages? It 
would be very helpful.
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5393: [MINOR] follow up HUDI-3921, address all comments

2022-05-05 Thread GitBox


alexeykudinkin commented on code in PR #5393:
URL: https://github.com/apache/hudi/pull/5393#discussion_r866304797


##
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##
@@ -840,13 +840,9 @@ private static Object rewriteRecordWithNewSchema(Object 
oldRecord, Schema oldSch
   }
 
   private static String createFullName(Deque fieldNames) {
-String result = "";
-if (!fieldNames.isEmpty()) {
-  List parentNames = new ArrayList<>();
-  fieldNames.descendingIterator().forEachRemaining(parentNames::add);
-  result = parentNames.stream().collect(Collectors.joining("."));
-}
-return result;
+return StreamSupport

Review Comment:
   No need for Stream you can just do `String.join`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (d794f4fbf9 -> abb4893b25)

2022-05-05 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from d794f4fbf9 [MINOR] Optimize code logic (#5499)
 add abb4893b25 [HUDI-2875] Make HoodieParquetWriter Thread safe and memory 
executor exit gracefully (#4264)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/io/HoodieConcatHandle.java |  3 ++
 .../org/apache/hudi/io/HoodieCreateHandle.java |  3 ++
 .../java/org/apache/hudi/io/HoodieMergeHandle.java |  3 ++
 .../apache/hudi/io/HoodieSortedMergeHandle.java|  3 ++
 .../hudi/io/HoodieUnboundedCreateHandle.java   |  3 ++
 .../hudi/io/storage/HoodieParquetWriter.java   | 10 +
 .../table/action/commit/HoodieMergeHelper.java |  5 ++-
 .../hudi/execution/FlinkLazyInsertIterable.java|  1 +
 .../hudi/table/action/commit/FlinkMergeHelper.java |  5 ++-
 .../hudi/execution/JavaLazyInsertIterable.java |  1 +
 .../hudi/table/action/commit/JavaMergeHelper.java  |  5 ++-
 .../hudi/execution/SparkLazyInsertIterable.java|  1 +
 .../bootstrap/OrcBootstrapMetadataHandler.java |  3 +-
 .../bootstrap/ParquetBootstrapMetadataHandler.java |  8 ++--
 .../TestBoundedInMemoryExecutorInSpark.java| 45 ++
 .../common/util/queue/BoundedInMemoryExecutor.java | 24 +++-
 .../common/testutils/HoodieTestDataGenerator.java  | 10 +++--
 17 files changed, 121 insertions(+), 12 deletions(-)



[GitHub] [hudi] yihua merged pull request #4264: [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor …

2022-05-05 Thread GitBox


yihua merged PR #4264:
URL: https://github.com/apache/hudi/pull/4264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5269: [HUDI-3636] Create new write clients for async table services in DeltaStreamer

2022-05-05 Thread GitBox


alexeykudinkin commented on code in PR #5269:
URL: https://github.com/apache/hudi/pull/5269#discussion_r866299432


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseCompactor.java:
##
@@ -31,16 +33,30 @@
 
   private static final long serialVersionUID = 1L;
 
+  protected final transient Object writeClientUpdateLock = new Object();
+  protected final transient List> 
oldCompactionClientList = new ArrayList<>();
+
   protected transient BaseHoodieWriteClient compactionClient;
 
+  protected boolean isCompactionRunning = false;
+
   public BaseCompactor(BaseHoodieWriteClient compactionClient) {
 this.compactionClient = compactionClient;
   }
 
   public abstract void compact(HoodieInstant instant) throws IOException;
 
   public void updateWriteClient(BaseHoodieWriteClient writeClient) 
{
-this.compactionClient = writeClient;
+synchronized (writeClientUpdateLock) {
+  if (!isCompactionRunning) {
+this.compactionClient.close();
+  } else {
+// Store the old compaction client so that they can be closed

Review Comment:
   Agree very strongly with the point above: unless there are very strong 
argument why we can not re-init Async Service itself, i believe we should 
follow t/h with an invariant that AS and Write Client lifecycles are coupled 
and solve this concurrency control issues at the root



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseCompactor.java:
##
@@ -31,16 +33,30 @@
 
   private static final long serialVersionUID = 1L;
 
+  protected final transient Object writeClientUpdateLock = new Object();
+  protected final transient List> 
oldCompactionClientList = new ArrayList<>();
+
   protected transient BaseHoodieWriteClient compactionClient;
 
+  protected boolean isCompactionRunning = false;
+
   public BaseCompactor(BaseHoodieWriteClient compactionClient) {
 this.compactionClient = compactionClient;
   }
 
   public abstract void compact(HoodieInstant instant) throws IOException;
 
   public void updateWriteClient(BaseHoodieWriteClient writeClient) 
{
-this.compactionClient = writeClient;
+synchronized (writeClientUpdateLock) {
+  if (!isCompactionRunning) {
+this.compactionClient.close();
+  } else {
+// Store the old compaction client so that they can be closed

Review Comment:
   Immutability is hard but very powerful property we should lean in on it as 
much as possible, only detour from it when there's no other choice (mostly for 
perf reasons)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #4264: [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor …

2022-05-05 Thread GitBox


alexeykudinkin commented on PR #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-1119021233

   LGTM, @nsivabalan @yihua can you please help land that one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1118874858

   
   ## CI report:
   
   * 2acc8007cc153d7d4a228e126ef706e5bb25cfbb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8454)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1118859963

   
   ## CI report:
   
   * 8b22298c933375b9af687093cecc68603d7e3c3d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5473: [HUDI-4003] Try to read all the log file to parse schema

2022-05-05 Thread GitBox


alexeykudinkin commented on code in PR #5473:
URL: https://github.com/apache/hudi/pull/5473#discussion_r866120175


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -109,13 +110,18 @@ private MessageType getTableParquetSchemaFromDataFile() {
   // Determine the file format based on the file name, and then 
extract schema from it.
   if (instantAndCommitMetadata.isPresent()) {
 HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-String filePath = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
-if 
(filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
-  // this is a log file
-  return readSchemaFromLogFile(new Path(filePath));
-} else {
-  return readSchemaFromBaseFile(filePath);
+Iterator filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
+MessageType type = null;
+while (filePaths.hasNext() && type == null) {

Review Comment:
   Understood. But behavior should be consistent regardless of whether this is 
COW or MOR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-05-05 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1118817192

   
   ## CI report:
   
   * 0e3f7e76cfdf18b34aafabf2a7949f6b1e62bddc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #5287: [HUDI-3849] AvroDeserializer supports AVRO_REBASE_MODE_IN_READ configuration

2022-05-05 Thread GitBox


alexeykudinkin commented on PR #5287:
URL: https://github.com/apache/hudi/pull/5287#issuecomment-1118812053

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-05-05 Thread GitBox


alexeykudinkin commented on code in PR #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r866116760


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1301,4 +1359,33 @@ public void close() {
 this.heartbeatClient.stop();
 this.txnManager.close();
   }
+
+  private void setWriteTimer(HoodieTable table) {
+String commitType = table.getMetaClient().getCommitActionType();
+if (commitType.equals(HoodieTimeline.COMMIT_ACTION)) {
+  writeTimer = metrics.getCommitCtx();
+} else if (commitType.equals(HoodieTimeline.DELTA_COMMIT_ACTION)) {
+  writeTimer = metrics.getDeltaCommitCtx();
+}
+  }
+
+  private void tryUpgrade(HoodieTableMetaClient metaClient, Option 
instantTime) {
+UpgradeDowngrade upgradeDowngrade =
+new UpgradeDowngrade(metaClient, config, context, 
upgradeDowngradeHelper);
+
+if 
(upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {
+  // Ensure no inflight commits by setting EAGER policy and explicitly 
cleaning all failed commits
+  List instantsToRollback = getInstantsToRollback(metaClient, 
HoodieFailedWritesCleaningPolicy.EAGER, instantTime);
+
+  Map> pendingRollbacks = 
getPendingRollbackInfos(metaClient);

Review Comment:
   The code was migrated as is so can't speak up from historical context, but 
my hunch is that we do that to make sure table is in a consistent state (no 
leftovers of failed commits) when we start the upgrade process.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1118804996

   
   ## CI report:
   
   * 2acc8007cc153d7d4a228e126ef706e5bb25cfbb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8454)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4017) Spark sql tests as part of github actions for diff spark versions

2022-05-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4017:
-
Labels: pull-request-available  (was: )

> Spark sql tests as part of github actions for diff spark versions
> -
>
> Key: HUDI-4017
> URL: https://issues.apache.org/jira/browse/HUDI-4017
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5512: [HUDI-4017] Improve spark sql coverage

2022-05-05 Thread GitBox


hudi-bot commented on PR #5512:
URL: https://github.com/apache/hudi/pull/5512#issuecomment-1118796395

   
   ## CI report:
   
   * 2acc8007cc153d7d4a228e126ef706e5bb25cfbb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-05-05 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1118784042

   
   ## CI report:
   
   * c9ee1edc0285fb17a9455cc5ca52072854d66a91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7921)
 
   * 0e3f7e76cfdf18b34aafabf2a7949f6b1e62bddc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8453)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua merged pull request #5499: [MINOR] Optimize code logic

2022-05-05 Thread GitBox


yihua merged PR #5499:
URL: https://github.com/apache/hudi/pull/5499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [MINOR] Optimize code logic (#5499)

2022-05-05 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d794f4fbf9 [MINOR] Optimize code logic (#5499)
d794f4fbf9 is described below

commit d794f4fbf9d8ce4c90507e4de36121ac1fc2fd4b
Author: qianchutao <72595723+qianchu...@users.noreply.github.com>
AuthorDate: Fri May 6 00:33:06 2022 +0800

[MINOR] Optimize code logic (#5499)
---
 .../org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
index 56124b82af..824c7375fa 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
@@ -625,8 +625,8 @@ public class HoodieDeltaStreamer implements Serializable {
 
ValidationUtils.checkArgument(baseFileFormat.equals(cfg.baseFileFormat) || 
cfg.baseFileFormat == null,
 "Hoodie table's base file format is of type " + baseFileFormat + " 
but passed in CLI argument is "
 + cfg.baseFileFormat);
-cfg.baseFileFormat = 
meta.getTableConfig().getBaseFileFormat().toString();
-this.cfg.baseFileFormat = cfg.baseFileFormat;
+cfg.baseFileFormat = baseFileFormat;
+this.cfg.baseFileFormat = baseFileFormat;
   } else {
 tableType = HoodieTableType.valueOf(cfg.tableType);
 if (cfg.baseFileFormat == null) {



[GitHub] [hudi] hudi-bot commented on pull request #4676: [HUDI-3304] support partial update on mor table

2022-05-05 Thread GitBox


hudi-bot commented on PR #4676:
URL: https://github.com/apache/hudi/pull/4676#issuecomment-1118775741

   
   ## CI report:
   
   * c9ee1edc0285fb17a9455cc5ca52072854d66a91 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=7921)
 
   * 0e3f7e76cfdf18b34aafabf2a7949f6b1e62bddc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4047) hoodie.avro.schema.validate error message refact

2022-05-05 Thread Istvan Darvas (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Darvas updated HUDI-4047:

Description: 
Hi Guys!

 

I have just used the schema validation and works as a charm, but :)

 

A few things would be very usefull

1.) after the error message: Failed schema compatibility check for \{FULL JSON 
Compatible payload should come}

2.) in the JSON payload should contain "violations": [\\{item}, \\{item}] 

  if go over all the violations is complex, or not performant then just print 
the first oine

   "violation": \{"field_name": field_name,  "writerSchema": "writer_schema, 
"tableSchema": table_schema }

 

Why? ;) - if someone has a table with a lot of cols/feature would be easier to 
find the discrepenacy.

So the debug process would be copy the FULL JSON into a json editor, and check 
the nodes... 

this would speed up the the debug for me ;) but maybe I am not alone with this.

 

Thanks,

 

I got an exception like this for example and I would like a nicer one like I 
explained above:

Caused by: org.apache.hudi.exception.HoodieException: Failed schema 
compatibility check for writerSchema 
:{"type":"record","name":"iot_raw_{_}ingress_pkg_decoded_rep_record","namespace":"hoodie.iot_raw{_}_ingress_pkg_decoded_rep","fields":[

{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null}

,{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"correlation_id","type":["null","string"],"default":null},{"name":"iot_pkg_receive_time","type":["null",

{"type":"long","logicalType":"timestamp-micros"}

],"default":null},{"name":"parsing_time","type":["null",

{"type":"long","logicalType":"timestamp-micros"}

],"default":null},{"name":"receive_time","type":["null",

{"type":"long","logicalType":"timestamp-micros"}

],"default":null},{"name":"aggregate_id","type":["null","string"],"default":null},{"name":"message_id","type":["null","int"],"default":null},{"name":"message_type_name","type":["null","string"],"default":null},{"name":"message_type_id","type":["null","int"],"default":null},{"name":"report_id","type":["null","int"],"default":null},{"name":"report_type_name","type":["null","string"],"default":null},{"name":"report_type_id","type":["null","int"],"default":null},{"name":"report_dcd_payload","type":["null","string"],"default":null}]},
 table schema 
:{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields":[

{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null}

,{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"correlation_id","type":"string"},\{"name":"iot_pkg_receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"parsing_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"aggregate_id","type":"string"},\{"name":"message_id","type":"int"},\{"name":"message_type_name","type":"string"},\{"name":"message_type_id","type":"int"},\{"name":"report_id","type":"int"},\{"name":"report_type_name","type":"string"},\{"name":"report_type_id","type":"int"},\{"name":"report_dcd_payload","type":"string"},{"name":"year","type":["null","int"],"default":null},{"name":"month","type":["null","int"],"default":null},{"name":"day","type":["null","int"],"default":null}]},
 base path :s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep
        at 
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:682)
        at 
org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:688)
        ... 42 more

 

  was:
Hi Guys!

 

I have just used the schema validation and works as a charm, but :)

 

A few things would be very usefull

1.) after the error message: Failed schema compatibility check for \{FULL JSON 
Compatible payload should come}

2.) in the JSON payload should contain "violations": [\{item}, \{item}] 

  if go over all the violations is complex, or not performant then just print 
the first oine

   "violation": \{"fiel_name": fied_name,  "writerSchema": "writer_schema, 
"tableSchema": table_schema }

 

Why? ;) - if someone has a table with a lot of cols/feature would be easier to 
find the discrepenacy.

So the debug process would be copy the FULL JSON into a json editor, and check 
the nodes... 

this would speed up the the debug for me ;) but may

[jira] [Updated] (HUDI-4047) hoodie.avro.schema.validate error message refact

2022-05-05 Thread Istvan Darvas (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Darvas updated HUDI-4047:

Priority: Minor  (was: Major)

> hoodie.avro.schema.validate error message refact
> 
>
> Key: HUDI-4047
> URL: https://issues.apache.org/jira/browse/HUDI-4047
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Istvan Darvas
>Priority: Minor
>
> Hi Guys!
>  
> I have just used the schema validation and works as a charm, but :)
>  
> A few things would be very usefull
> 1.) after the error message: Failed schema compatibility check for \{FULL 
> JSON Compatible payload should come}
> 2.) in the JSON payload should contain "violations": [\{item}, \{item}] 
>   if go over all the violations is complex, or not performant then just print 
> the first oine
>    "violation": \{"fiel_name": fied_name,  "writerSchema": "writer_schema, 
> "tableSchema": table_schema }
>  
> Why? ;) - if someone has a table with a lot of cols/feature would be easier 
> to find the discrepenacy.
> So the debug process would be copy the FULL JSON into a json editor, and 
> check the nodes... 
> this would speed up the the debug for me ;) but maybe I am not alone with 
> this.
>  
> Thanks,
>  
> I got an exception like this for example and I would like a nicer one like I 
> explained above:
> Caused by: org.apache.hudi.exception.HoodieException: Failed schema 
> compatibility check for writerSchema 
> :\{"type":"record","name":"iot_raw__ingress_pkg_decoded_rep_record","namespace":"hoodie.iot_raw__ingress_pkg_decoded_rep","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"correlation_id","type":["null","string"],"default":null},\{"name":"iot_pkg_receive_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"parsing_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"receive_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"aggregate_id","type":["null","string"],"default":null},\{"name":"message_id","type":["null","int"],"default":null},\{"name":"message_type_name","type":["null","string"],"default":null},\{"name":"message_type_id","type":["null","int"],"default":null},\{"name":"report_id","type":["null","int"],"default":null},\{"name":"report_type_name","type":["null","string"],"default":null},\{"name":"report_type_id","type":["null","int"],"default":null},\{"name":"report_dcd_payload","type":["null","string"],"default":null}]},
>  table schema 
> :\{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"correlation_id","type":"string"},\{"name":"iot_pkg_receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"parsing_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"aggregate_id","type":"string"},\{"name":"message_id","type":"int"},\{"name":"message_type_name","type":"string"},\{"name":"message_type_id","type":"int"},\{"name":"report_id","type":"int"},\{"name":"report_type_name","type":"string"},\{"name":"report_type_id","type":"int"},\{"name":"report_dcd_payload","type":"string"},\{"name":"year","type":["null","int"],"default":null},\{"name":"month","type":["null","int"],"default":null},\{"name":"day","type":["null","int"],"default":null}]},
>  base path :s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep
>         at 
> org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:682)
>         at 
> org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:688)
>         ... 42 more
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4047) hoodie.avro.schema.validate error message refact

2022-05-05 Thread Istvan Darvas (Jira)
Istvan Darvas created HUDI-4047:
---

 Summary: hoodie.avro.schema.validate error message refact
 Key: HUDI-4047
 URL: https://issues.apache.org/jira/browse/HUDI-4047
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Istvan Darvas


Hi Guys!

 

I have just used the schema validation and works as a charm, but :)

 

A few things would be very usefull

1.) after the error message: Failed schema compatibility check for \{FULL JSON 
Compatible payload should come}

2.) in the JSON payload should contain "violations": [\{item}, \{item}] 

  if go over all the violations is complex, or not performant then just print 
the first oine

   "violation": \{"fiel_name": fied_name,  "writerSchema": "writer_schema, 
"tableSchema": table_schema }

 

Why? ;) - if someone has a table with a lot of cols/feature would be easier to 
find the discrepenacy.

So the debug process would be copy the FULL JSON into a json editor, and check 
the nodes... 

this would speed up the the debug for me ;) but maybe I am not alone with this.

 

Thanks,

 

I got an exception like this for example and I would like a nicer one like I 
explained above:

Caused by: org.apache.hudi.exception.HoodieException: Failed schema 
compatibility check for writerSchema 
:\{"type":"record","name":"iot_raw__ingress_pkg_decoded_rep_record","namespace":"hoodie.iot_raw__ingress_pkg_decoded_rep","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"correlation_id","type":["null","string"],"default":null},\{"name":"iot_pkg_receive_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"parsing_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"receive_time","type":["null",{"type":"long","logicalType":"timestamp-micros"}],"default":null},\{"name":"aggregate_id","type":["null","string"],"default":null},\{"name":"message_id","type":["null","int"],"default":null},\{"name":"message_type_name","type":["null","string"],"default":null},\{"name":"message_type_id","type":["null","int"],"default":null},\{"name":"report_id","type":["null","int"],"default":null},\{"name":"report_type_name","type":["null","string"],"default":null},\{"name":"report_type_id","type":["null","int"],"default":null},\{"name":"report_dcd_payload","type":["null","string"],"default":null}]},
 table schema 
:\{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},\{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},\{"name":"correlation_id","type":"string"},\{"name":"iot_pkg_receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"parsing_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"receive_time","type":{"type":"long","logicalType":"timestamp-micros"}},\{"name":"aggregate_id","type":"string"},\{"name":"message_id","type":"int"},\{"name":"message_type_name","type":"string"},\{"name":"message_type_id","type":"int"},\{"name":"report_id","type":"int"},\{"name":"report_type_name","type":"string"},\{"name":"report_type_id","type":"int"},\{"name":"report_dcd_payload","type":"string"},\{"name":"year","type":["null","int"],"default":null},\{"name":"month","type":["null","int"],"default":null},\{"name":"day","type":["null","int"],"default":null}]},
 base path :s3://scgps-datalake/iot_raw/ingress_pkg_decoded_rep
        at 
org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:682)
        at 
org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:688)
        ... 42 more

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] fengjian428 commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-05-05 Thread GitBox


fengjian428 commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r866079366


##
hudi-common/src/test/java/org/apache/hudi/common/model/TestPartialUpdateAvroPayload.java:
##
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.JsonProperties;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+/**
+ * Unit tests {@link TestPartialUpdateAvroPayload}.
+ */
+public class TestPartialUpdateAvroPayload {

Review Comment:
   @codope 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] fengjian428 commented on a diff in pull request #4676: [HUDI-3304] support partial update on mor table

2022-05-05 Thread GitBox


fengjian428 commented on code in PR #4676:
URL: https://github.com/apache/hudi/pull/4676#discussion_r866078762


##
hudi-common/src/test/java/org/apache/hudi/common/model/TestPartialUpdateAvroPayload.java:
##
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.avro.JsonProperties;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+/**
+ * Unit tests {@link TestPartialUpdateAvroPayload}.
+ */
+public class TestPartialUpdateAvroPayload {

Review Comment:
   should this integ-test covers hive and trino?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5510: [minor] fix the flacky test ITTestHoodieDataSource#testStreamWriteBat…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5510:
URL: https://github.com/apache/hudi/pull/5510#issuecomment-1118719037

   
   ## CI report:
   
   * ec40e1e3c0495c301a964a1fa1a740dc6a1d0e00 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5505: [HUDI-3687] Enable spark32 in GH actions

2022-05-05 Thread GitBox


hudi-bot commented on PR #5505:
URL: https://github.com/apache/hudi/pull/5505#issuecomment-1118698058

   
   ## CI report:
   
   * ad175e8b93bc54e10a846cbcb8caad988ce8280b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5505: [HUDI-3687] Enable spark32 in GH actions

2022-05-05 Thread GitBox


hudi-bot commented on PR #5505:
URL: https://github.com/apache/hudi/pull/5505#issuecomment-1118651795

   
   ## CI report:
   
   * 28c50bbaacaa1b70811f02c3a1e02138dfd09e15 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8447)
 
   * ad175e8b93bc54e10a846cbcb8caad988ce8280b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8452)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] VitoMakarevich opened a new issue, #5511: [SUPPORT] Inremental query from the beginning of time

2022-05-05 Thread GitBox


VitoMakarevich opened a new issue, #5511:
URL: https://github.com/apache/hudi/issues/5511

   **Describe the problem you faced**
   
   Incremental query with `begin.instanttime` less than the first commit time 
is different, depending on how many commits added.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Write to hudi table n commits, where n > number of commits config param
   2. trigger an incremental query with `begin.instanttime` less than the first 
commit, e.g. `0`.
   3. verify the number of output rows and compare it with the number of rows 
in the snapshot. It will contain fewer rows compared to the snapshot.
   Here is the [sample 
repo](https://github.com/VitoMakarevich/hudi-incremental-issue) with 
reproduction.
   
   But if you do the same thing, but write n commits, where n < number of 
commits config param, then query from `0`, you will
   see the number of rows equal to the number of rows in the snapshot.
   
   **Expected behavior**
   
   I expect that incremental behavior with `begin.instanttime` less than the 
first commit, should be the same independently of the fact that something was 
cleaned or not.
   
   **Environment Description**
   
   * Hudi version : 0.9.0, 0.10.0, works correctly for 0.11.0
   
   * Spark version : 3.1.2
   
   * Storage (HDFS/S3/GCS..) : s3/local file
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I assume that this [PR](https://github.com/apache/hudi/pull/3946/files) 
fixes the behavior for `0.11.0`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1118638660

   
   ## CI report:
   
   * 2d627024cd13ca2389008a649b3defd9fba3b04c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8435)
 
   * 8b22298c933375b9af687093cecc68603d7e3c3d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8451)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5509: [HUDI-4041] compact with precombineKey in RealtimeCompactedRecordRead…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5509:
URL: https://github.com/apache/hudi/pull/5509#issuecomment-1118634615

   
   ## CI report:
   
   * 99e207f779a258ff32c57c9d1b962d772213c081 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8448)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-05 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1118634531

   
   ## CI report:
   
   * 2d627024cd13ca2389008a649b3defd9fba3b04c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8435)
 
   * 8b22298c933375b9af687093cecc68603d7e3c3d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5510: [minor] fix the flacky test ITTestHoodieDataSource#testStreamWriteBat…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5510:
URL: https://github.com/apache/hudi/pull/5510#issuecomment-1118629985

   
   ## CI report:
   
   * ec40e1e3c0495c301a964a1fa1a740dc6a1d0e00 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8450)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3957) Evaluate Support for spark2 and scala12

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3957:
--
Status: Patch Available  (was: In Progress)

> Evaluate Support for spark2 and scala12 
> 
>
> Key: HUDI-3957
> URL: https://issues.apache.org/jira/browse/HUDI-3957
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-05-05 at 8.51.11 AM.png, Screen Shot 
> 2022-05-05 at 8.53.39 AM.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We may need to evaluate the need for supporting spark2 and scala 12 and 
> deprecate if there is not much usage. 
>  
> From the overall stats, hudi-spark_2.12 bundle usage is 2%. Among 
> hudi_spark2.12 bundle usages, most usages are from 0.7, 0.8. and 0.9. 0.10 
> and above are ~ 5% among all usages of hudi-spark2.12 bundle. So, probably we 
> can deprecate the usage of spark2 and scala12 going forward and ask users to 
> use spark3. 
>  
> !Screen Shot 2022-05-05 at 8.51.11 AM.png!
>  
>  
> !Screen Shot 2022-05-05 at 8.53.39 AM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4018:
--
Status: Patch Available  (was: In Progress)

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Prepare 5 to 8 minimal set of yamls that can be used against any write mode 
> and against any query engine and table types. 
>  
> For eg:
> lets say we come up with 6 yamls covering all cases. 
> Same set should work for all possible combinations from below categories. 
>  
> Table type: 
> COW/MOR
> Metadata:
> enable/disable
> Dataset type:
> partitioned/non-partitioned
> Write mode:
> delta streamer, spark datasource, spark sql, spark streaming sink
>  
> Query engine: 
> spark datasource, hive, presto, trino
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4027) add support to test non-core write operations (insert overwrite, delete partitions) to integ test framework

2022-05-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4027:
--
Status: Patch Available  (was: In Progress)

> add support to test non-core write operations (insert overwrite, delete 
> partitions) to integ test framework
> ---
>
> Key: HUDI-4027
> URL: https://issues.apache.org/jira/browse/HUDI-4027
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>
> we need support for testing non-core operations. 
> insert overwrite
> insert overwrite table
> delete partitions
>  
> spark-datasource writes
> spark-sql writes. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5510: [minor] fix the flacky test ITTestHoodieDataSource#testStreamWriteBat…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5510:
URL: https://github.com/apache/hudi/pull/5510#issuecomment-1118625639

   
   ## CI report:
   
   * ec40e1e3c0495c301a964a1fa1a740dc6a1d0e00 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5505: [HUDI-3687] Enable spark32 in GH actions

2022-05-05 Thread GitBox


hudi-bot commented on PR #5505:
URL: https://github.com/apache/hudi/pull/5505#issuecomment-1118625554

   
   ## CI report:
   
   * 28c50bbaacaa1b70811f02c3a1e02138dfd09e15 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8447)
 
   * e6575ef2e0082baa6608f8c3f75a4ec6d1312662 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8449)
 
   * ad175e8b93bc54e10a846cbcb8caad988ce8280b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] andykrk commented on issue #4604: [SUPPORT] Archive functionality fails

2022-05-05 Thread GitBox


andykrk commented on issue #4604:
URL: https://github.com/apache/hudi/issues/4604#issuecomment-1118619225

   @nsivabalan We need to park this item temporarily. We may get some 
additional resources to work on that on our side after this. I will keep you 
posted on that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 opened a new pull request, #5510: [minor] fix the flacky test ITTestHoodieDataSource#testStreamWriteBat…

2022-05-05 Thread GitBox


danny0405 opened a new pull request, #5510:
URL: https://github.com/apache/hudi/pull/5510

   …chReadOptimized
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5506: [HUDI-4042] Support truncate-partition for Spark-3.2

2022-05-05 Thread GitBox


hudi-bot commented on PR #5506:
URL: https://github.com/apache/hudi/pull/5506#issuecomment-1118579317

   
   ## CI report:
   
   * 6f6ffdafc1e6ade28a7d340024905374353032af Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8444)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5505: [HUDI-3687] Enable spark32 in GH actions

2022-05-05 Thread GitBox


hudi-bot commented on PR #5505:
URL: https://github.com/apache/hudi/pull/5505#issuecomment-1118579254

   
   ## CI report:
   
   * 28c50bbaacaa1b70811f02c3a1e02138dfd09e15 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8447)
 
   * e6575ef2e0082baa6608f8c3f75a4ec6d1312662 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8449)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-4043) Clean the marker files for compaction rollback

2022-05-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-4043.

Resolution: Won't Fix

> Clean the marker files for compaction rollback
> --
>
> Key: HUDI-4043
> URL: https://issues.apache.org/jira/browse/HUDI-4043
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.1, 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] danny0405 closed pull request #5508: [HUDI-4043] Clean the marker files for compaction rollback

2022-05-05 Thread GitBox


danny0405 closed pull request #5508: [HUDI-4043] Clean the marker files for 
compaction rollback
URL: https://github.com/apache/hudi/pull/5508


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #5508: [HUDI-4043] Clean the marker files for compaction rollback

2022-05-05 Thread GitBox


danny0405 commented on PR #5508:
URL: https://github.com/apache/hudi/pull/5508#issuecomment-1118567331

   Close because `BaseRollbackActionExecutor.runRollback` already did that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-05 Thread GitBox


hudi-bot commented on PR #5445:
URL: https://github.com/apache/hudi/pull/5445#issuecomment-1118566015

   
   ## CI report:
   
   * 1e9b3ac4c34f97f5ccf3a639cc74b7081eeaab37 UNKNOWN
   * a5669a78b314a5dc4166bcc4d41d2a377653da75 UNKNOWN
   * 6426727bb88fce863d7aa50ef04b2cdac7acb2e2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8446)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >