[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex

2022-10-07 Thread GitBox


YuweiXiao commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r990601721


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -179,15 +197,125 @@ public void close() throws Exception {
   }
 
   protected List getAllQueryPartitionPaths() {
+if (cachedAllPartitionPaths != null) {
+  return cachedAllPartitionPaths;
+}
+
+loadAllQueryPartitionPaths();
+return cachedAllPartitionPaths;
+  }
+
+  private void loadAllQueryPartitionPaths() {
 List queryRelativePartitionPaths = queryPaths.stream()
 .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
 .collect(Collectors.toList());
 
-// Load all the partition path from the basePath, and filter by the query 
partition path.
-// TODO load files from the queryRelativePartitionPaths directly.
-List matchedPartitionPaths = getAllPartitionPathsUnchecked()
-.stream()
-.filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
+this.cachedAllPartitionPaths = 
listQueryPartitionPaths(queryRelativePartitionPaths);
+
+// If the partition value contains InternalRow.empty, we query it as a 
non-partitioned table.
+this.queryAsNonePartitionedTable = 
this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0);
+  }
+
+  protected Map> getAllInputFileSlices() {
+if (!isAllInputFileSlicesCached) {

Review Comment:
   Yeah, good point. 1) generalize to batch get  2) load only remaining 
partitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex

2022-10-07 Thread GitBox


YuweiXiao commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r990601573


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -179,15 +197,125 @@ public void close() throws Exception {
   }
 
   protected List getAllQueryPartitionPaths() {
+if (cachedAllPartitionPaths != null) {
+  return cachedAllPartitionPaths;
+}
+
+loadAllQueryPartitionPaths();

Review Comment:
   Yes, you are right. I will have it inlined.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-07 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272242144

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * f8732300afaf355296ca13fe7f2d3e9a131315d6 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12063)
 
   * 18ef7b44488dff256728b2bba024b4a4d00aebe9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-07 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272241019

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 1d98224805b75fc0c9c8ec54948870e96c4b54e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12043)
 
   * f8732300afaf355296ca13fe7f2d3e9a131315d6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12063)
 
   * 18ef7b44488dff256728b2bba024b4a4d00aebe9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-07 Thread GitBox


hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272240117

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * 1d98224805b75fc0c9c8ec54948870e96c4b54e7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12043)
 
   * f8732300afaf355296ca13fe7f2d3e9a131315d6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-10-07 Thread GitBox


alexeykudinkin commented on code in PR #6358:
URL: https://github.com/apache/hudi/pull/6358#discussion_r990595189


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -185,7 +185,7 @@ public class HoodieWriteConfig extends HoodieConfig {
 
   public static final ConfigProperty AVRO_SCHEMA_VALIDATE_ENABLE = 
ConfigProperty
   .key("hoodie.avro.schema.validate")
-  .defaultValue("false")
+  .defaultValue("true")

Review Comment:
   This is flipped to default to make sure proper schema validation are run for 
every operation on the table



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##
@@ -81,20 +79,7 @@
   public static IgnoreRecord IGNORE_RECORD = new IgnoreRecord();
 
   /**
-   * The specified schema of the table. ("specified" denotes that this is 
configured by the client,
-   * as opposed to being implicitly fetched out of the commit metadata)
-   */
-  protected final Schema tableSchema;
-  protected final Schema tableSchemaWithMetaFields;

Review Comment:
   These fields were misused and are redundant, hence deleted



##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaCompatibility.java:
##
@@ -0,0 +1,941 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.avro;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.Schema.Type;
+import org.apache.hudi.common.util.Either;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.lang.reflect.InvocationTargetException;
+import java.lang.reflect.Method;
+import java.util.ArrayDeque;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Deque;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.stream.Collectors;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+
+/**
+ * Evaluate the compatibility between a reader schema and a writer schema. A
+ * reader and a writer schema are declared compatible if all datum instances of
+ * the writer schema can be successfully decoded using the specified reader
+ * schema.
+ *
+ * NOTE: PLEASE READ CAREFULLY BEFORE CHANGING
+ *
+ *   This code is borrowed from Avro 1.10, with the following 
modifications:
+ *   
+ * Compatibility checks ignore schema name, unless schema is held 
inside
+ * a union
+ *   
+ *
+ */
+public class AvroSchemaCompatibility {

Review Comment:
   Context: Avro requires at all times that schema's names have to match in 
order for them to be counted as compatible. Provided that only Avro bears the 
names on the schemas themselves (Spark does not, for ex) this makes for ex, 
some schemas converted from Spark's [[StructType]] incompatible w/ Avro
   
   
   This has code is mostly borrowed as is from Avro 1.10 w/ the following 
critical adjustments: Schema names now are only checked in following 2 cases:
   
- In case it's a top-level schema
- In case schema is enclosed into a union (in which case its name might be 
used for reverse-lookup)
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseMergeHelper.java:
##
@@ -18,91 +18,47 @@
 
 package org.apache.hudi.table.action.commit;
 
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.client.utils.MergingIterator;
-import org.apache.hudi.common.model.HoodieBaseFile;
-import org.apache.hudi.common.model.HoodieRecordPayload;
 import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
-import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.io.HoodieMergeHandle;
 import org.apache.hudi.io.storage.HoodieFileReader;
 import org.apache.hudi.io.storage.HoodieFileReaderFactory;
 import org.apache.hudi.table.HoodieTable;
 
-import org.apache.avro.ge

[GitHub] [hudi] xushiyan commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0

2022-10-07 Thread GitBox


xushiyan commented on issue #6692:
URL: https://github.com/apache/hudi/issues/6692#issuecomment-1272232597

   > @xushiyan I use the fat jar, but I do not know what is added to the 
classpath by AWS in Glue 3.
   
   @eshu by fat jar you mean bundle jar? Is it Hudi Spark 3.1 bundle? And it's 
the only bundle you used? I need to reproduce this by putting the same jar as 
you did. So pls provide info on what jars you added to your glue job. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eshu commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0

2022-10-07 Thread GitBox


eshu commented on issue #6692:
URL: https://github.com/apache/hudi/issues/6692#issuecomment-1272232164

   @xushiyan I use the fat jar, but I do not know what is added to the 
classpath by AWS in Glue 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on PR #6818:
URL: https://github.com/apache/hudi/pull/6818#issuecomment-1272218777

   > this pr will support cdc data block's flushing and cdc log file's 
rollover. this features need to upgrade the write stat about cdc, that is the 
key point need to be discuss.
   > 
   > there maybe are solutions:
   > 
   > 1. like this pr: both `cdcPaths` and `cdcWriteBytes` are the `list` data 
type.
   > 2. use a map, like:
   > 
   > ```
   > cdcWriteStats: {
   >   "cdclogfile1": cdclogFile1Size,
   >   "cdclogfile1": cdclogFile1Size
   > }
   > ```
   > 
   > cc @xushiyan @alexeykudinkin @danny0405 WDYH?
   
   What is the file size used for ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6818:
URL: https://github.com/apache/hudi/pull/6818#discussion_r990585049


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordIterator.java:
##
@@ -27,50 +27,94 @@
 import org.apache.avro.generic.IndexedRecord;
 
 import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
 
 import java.io.IOException;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicInteger;
 
 public class HoodieCDCLogRecordIterator implements 
ClosableIterator {
 
-  private final HoodieLogFile cdcLogFile;
+  private final FileSystem fs;
 
-  private final HoodieLogFormat.Reader reader;
+  private final Schema cdcSchema;
+
+  private final Iterator cdcLogFileIter;
+
+  private HoodieLogFormat.Reader reader;
+
+  /**
+   * Due to the hasNext of {@link HoodieLogFormat.Reader} is not idempotent,
+   * Here guarantee idempotent by `hasNextCall` and `nextCall`.
+   */
+  private final AtomicInteger hasNextCall = new AtomicInteger(0);
+  private final AtomicInteger nextCall = new AtomicInteger(0);
 
   private ClosableIterator itr;
 
-  public HoodieCDCLogRecordIterator(
-  FileSystem fs,
-  Path cdcLogPath,
-  Schema cdcSchema) throws IOException {
-this.cdcLogFile = new HoodieLogFile(fs.getFileStatus(cdcLogPath));
-this.reader = new HoodieLogFileReader(fs, cdcLogFile, cdcSchema,
-HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false);
+  public HoodieCDCLogRecordIterator(FileSystem fs, HoodieLogFile[] 
cdcLogFiles, Schema cdcSchema) {
+this.fs = fs;
+this.cdcSchema = cdcSchema;
+this.cdcLogFileIter = Arrays.stream(cdcLogFiles).iterator();
   }

Review Comment:
   Do we have some sort sequence for these files ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6818:
URL: https://github.com/apache/hudi/pull/6818#discussion_r990584901


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordIterator.java:
##
@@ -27,50 +27,94 @@
 import org.apache.avro.generic.IndexedRecord;
 
 import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
 
 import java.io.IOException;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicInteger;
 
 public class HoodieCDCLogRecordIterator implements 
ClosableIterator {
 
-  private final HoodieLogFile cdcLogFile;
+  private final FileSystem fs;
 
-  private final HoodieLogFormat.Reader reader;
+  private final Schema cdcSchema;
+
+  private final Iterator cdcLogFileIter;
+
+  private HoodieLogFormat.Reader reader;
+
+  /**
+   * Due to the hasNext of {@link HoodieLogFormat.Reader} is not idempotent,
+   * Here guarantee idempotent by `hasNextCall` and `nextCall`.
+   */
+  private final AtomicInteger hasNextCall = new AtomicInteger(0);
+  private final AtomicInteger nextCall = new AtomicInteger(0);

Review Comment:
   We can avoid these two variables by a `currentRecord` reference from the 
current iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6818:
URL: https://github.com/apache/hudi/pull/6818#discussion_r990584548


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java:
##
@@ -89,9 +94,19 @@ protected void writeInsertRecord(HoodieRecord 
hoodieRecord, Option close() {
 List writeStatuses = super.close();
 // if there are cdc data written, set the CDC-related information.
-Option cdcResult =
-HoodieCDCLogger.writeCDCDataIfNeeded(cdcLogger, recordsWritten, 
insertRecordsWritten);
-HoodieCDCLogger.setCDCStatIfNeeded(writeStatuses.get(0).getStat(), 
cdcResult, partitionPath, fs);
+
+if (cdcLogger == null || recordsWritten == 0L || (recordsWritten == 
insertRecordsWritten)) {
+  // the following cases where we do not need to write out the cdc file:

Review Comment:
   The if condition is not suitable for Flink, we may need some change for 
flink cdc handles.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6818:
URL: https://github.com/apache/hudi/pull/6818#discussion_r990584508


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java:
##
@@ -254,12 +256,12 @@ public String getPath() {
   }
 
   @Nullable
-  public String getCdcPath() {
-return cdcPath;
+  public List getCdcPaths() {
+return cdcPaths;
   }
 
-  public void setCdcPath(String cdcPath) {
-this.cdcPath = cdcPath;
+  public void setCdcPath(List cdcPaths) {
+this.cdcPaths = cdcPaths;

Review Comment:
   setCdcPath -> setCdcPaths



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6818:
URL: https://github.com/apache/hudi/pull/6818#discussion_r990583580


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -73,35 +80,56 @@ public class HoodieCDCLogger implements Closeable {
 
   private final Schema cdcSchema;
 
-  private final String cdcSchemaString;
-
   // the cdc data
   private final Map cdcData;
 
+  private final Map 
cdcDataBlockHeader;
+
   // the cdc record transformer
   private final CDCTransformer transformer;
 
+  // Max block size to limit to for a log block
+  private final int maxBlockSize;
+
+  // Average cdc record size. This size is updated at the end of every log 
block flushed to disk
+  private long averageCDCRecordSize = 0;
+
+  // Number of records that must be written to meet the max block size for a 
log block
+  private AtomicInteger numOfCDCRecordInMemory = new AtomicInteger();
+

Review Comment:
   `numOfCDCRecordInMemory` -> `numOfCDCRecordsInMemory`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6845: [HUDI-4945] Support to trigger the clean in the flink batch mode.

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6845:
URL: https://github.com/apache/hudi/pull/6845#discussion_r990583385


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java:
##
@@ -96,9 +96,10 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context 
context) {
   pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream);
   // compaction
   if (OptionsResolver.needsAsyncCompaction(conf)) {
-// use synchronous compaction for bounded source.
+// use synchronous compaction and clean for bounded source.
 if (context.isBounded()) {
   conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false);
+  conf.setBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED, false);
 }

Review Comment:
   Yeah, thanks for the explanation, can we try to add a test case here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

2022-10-07 Thread GitBox


danny0405 commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990583099


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##
@@ -279,8 +279,9 @@ private FlinkOptions() {
   .defaultValue(false)// default read as batch
   .withDescription("Whether to skip compaction instants for streaming 
read,\n"
   + "there are two cases that this option can be used to avoid reading 
duplicates:\n"
-  + "1) you are definitely sure that the consumer reads faster than 
any compaction instants, "
-  + "usually with delta time compaction strategy that is long enough, 
for e.g, one week;\n"
+  + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` 
and you are definitely sure that the "
+  + "consumer reads faster than any compaction instants, usually with 
delta time compaction strategy that is "

Review Comment:
   Thanks for the enhancement, we need the compaction to preserve the commit 
time metadata field, and it is by default true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling

2022-10-07 Thread GitBox


hudi-bot commented on PR #6889:
URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272214966

   
   ## CI report:
   
   * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12062)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron resolved HUDI-4949.
--

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-4949:
-
Fix Version/s: 0.13.0

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-4949:


Assignee: Yann Byron

> Optimize cdc read to avoid problems that caused by reusing buffer underlying 
> the Row
> 
>
> Key: HUDI-4949
> URL: https://issues.apache.org/jira/browse/HUDI-4949
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4915) Spark Avro SerDe returns wrong result upon multiple calls

2022-10-07 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron closed HUDI-4915.

Resolution: Won't Fix

> Spark Avro SerDe returns wrong result upon multiple calls
> -
>
> Key: HUDI-4915
> URL: https://issues.apache.org/jira/browse/HUDI-4915
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, Spark Avro serializer/deserializer has a bug that it will return 
> the same object when we call this method twice continuously.  For example:
> val row1: InternalRow = ...
> val row2: InternalRow = ... // record2 is different with record1
>  
> val serializeredRecord1 = serialize(row1)
> val serializeredRecord2 = serialize(row2)
> serializeredRecord1.equals(serializeredRecord2)
>  
> That is because we use the `val` to declare the serializer/deserializer 
> methods, so the latter's result will cover the previous one.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4857) Replace DataFrame with HoodieData in Spark side

2022-10-07 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An reassigned HUDI-4857:


Assignee: (was: Hui An)

> Replace DataFrame with HoodieData in Spark side
> ---
>
> Key: HUDI-4857
> URL: https://issues.apache.org/jira/browse/HUDI-4857
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: code-quality, spark
>Reporter: Hui An
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] YannByron opened a new pull request, #6891: [MINOR] update committer list

2022-10-07 Thread GitBox


YannByron opened a new pull request, #6891:
URL: https://github.com/apache/hudi/pull/6891

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] felixYyu commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2022-10-07 Thread GitBox


felixYyu commented on code in PR #5064:
URL: https://github.com/apache/hudi/pull/5064#discussion_r990579376


##
hudi-common/src/main/java/org/apache/hudi/common/table/catalog/FileBasedMetaClient.java:
##
@@ -0,0 +1,196 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.catalog;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PathFilter;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.ConsistencyGuardConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.fs.FailSafeConsistencyGuard;
+import org.apache.hudi.common.fs.FileSystemRetryConfig;
+import org.apache.hudi.common.fs.HoodieRetryWrapperFileSystem;
+import org.apache.hudi.common.fs.HoodieWrapperFileSystem;
+import org.apache.hudi.common.fs.NoOpConsistencyGuard;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.TimelineLayout;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.exception.TableNotFoundException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class FileBasedMetaClient implements HoodieMetaClient, Serializable {
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LogManager.getLogger(FileBasedMetaClient.class);
+  public static final String METAFOLDER_NAME = ".hoodie";
+  public static final String AUXILIARYFOLDER_NAME = METAFOLDER_NAME + 
Path.SEPARATOR + ".aux";
+  public static final String SCHEMA_FOLDER_NAME = ".schema";
+
+  private SerializableConfiguration hadoopConf;
+  private ConsistencyGuardConfig consistencyGuardConfig;
+  private FileSystemRetryConfig fileSystemRetryConfig;
+
+  public FileBasedMetaClient(SerializableConfiguration hadoopConf) {
+this.hadoopConf = hadoopConf;
+this.consistencyGuardConfig = ConsistencyGuardConfig.newBuilder().build();
+this.fileSystemRetryConfig = FileSystemRetryConfig.newBuilder().build();
+  }
+
+  public FileBasedMetaClient(SerializableConfiguration hadoopConf, 
ConsistencyGuardConfig consistencyGuardConfig, FileSystemRetryConfig 
fileSystemRetryConfig) {
+this.hadoopConf = hadoopConf;
+this.consistencyGuardConfig = consistencyGuardConfig;
+this.fileSystemRetryConfig = fileSystemRetryConfig;
+  }
+
+  public HoodieWrapperFileSystem getFs(String basePath) {
+FileSystem fileSystem = FSUtils.getFs(new Path(basePath, METAFOLDER_NAME), 
hadoopConf.newCopy());
+
+if (fileSystemRetryConfig.isFileSystemActionRetryEnable()) {
+  fileSystem = new HoodieRetryWrapperFileSystem(fileSystem,
+  fileSystemRetryConfig.getMaxRetryIntervalMs(),
+  fileSystemRetryConfig.getMaxRetryNumbers(),
+  fileSystemRetryConfig.getInitialRetryIntervalMs(),
+  fileSystemRetryConfig.getRetryExceptions());
+}
+ValidationUtils.checkArgument(!(fileSystem instanceof 
HoodieWrapperFileSystem),
+"File System not expected to be that of HoodieWrapperFileSystem");
+return new HoodieWrapperFileSystem(fileSystem,
+consistencyGuardConfig.isConsistencyCheckEnabled()
+? new FailSafeConsistencyGuard(fileSystem, consistencyGuardConfig)
+: new NoOpConsistencyGuard());
+  }
+
+  public HoodieTableConfig getHoodieTableConfig(String basePath, String 
payloadClass) {
+HoodieWrapperFileSystem fs = getFs(basePath);
+Path metaPath = new Path(basePath, METAFOLDER_NAME);
+TableNotFoundException.checkTableValidity(fs, new Path(basePath), 
metaPath);
+return new HoodieTableConfig(fs, metaPath.toString(), payloadClass);
+  }
+
+  public static HoodieTableConfig getHoodieTableConfig(String basePath,

[GitHub] [hudi] danny0405 commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


danny0405 commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1272208595

   > @guanziyue Thanks for your positive feedback. IIUC, this improvement is 
effective for both Flink/Spark Streaming jobs when build `FileSystemView`. And 
the time saved is also considerable as @ThinkerLei mentioned above. Of course, 
it involve some additional memory cost. I totally agree gatekeeper's concerned 
especially about Flink engine, the restart cost will not be accepted when OOM. 
Actually in our prod cluster, we did not observe some extra OOM due to this 
change. Anyway, I think this is one choice for performance improvement. FYI.
   
   Thanks for the feedback, can we have some numbers about the additional 
memory overhead here ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272202007

   
   ## CI report:
   
   * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN
   * 3342da1ce44cd5405714218b671cbf4863d2c6ff Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12061)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6745: Fix comment in RFC46

2022-10-07 Thread GitBox


alexeykudinkin commented on code in PR #6745:
URL: https://github.com/apache/hudi/pull/6745#discussion_r987470813


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##
@@ -461,6 +461,18 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
   }
 
   protected def getTableState: HoodieTableState = {
+val mergerImpls = (if 
(optParams.contains(HoodieWriteConfig.MERGER_IMPLS.key())) {

Review Comment:
   @wzx140 let's abstract this behind common utility in the config



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -870,18 +869,17 @@ object HoodieSparkSqlWriter {
   hoodieRecord
 }).toJavaRDD()
   case HoodieRecord.HoodieRecordType.SPARK =>
+log.debug(s"Use ${HoodieRecord.HoodieRecordType.SPARK}")

Review Comment:
   Let's lift this log before the match so that we can tell if it's Avro or 
Spark



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -151,11 +155,10 @@ protected  void processNextRecord(HoodieRecord 
hoodieRecord) throws IOExce
 
   HoodieRecord oldRecord = records.get(key);
   T oldValue = oldRecord.getData();
-  T combinedValue = ((HoodieRecord) recordMerger.merge(oldRecord, 
hoodieRecord, readerSchema, this.getPayloadProps()).get()).getData();

Review Comment:
   Was `getPayloadProps` change intentional? Just calling it out to make sure 
we can validate 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -253,21 +253,21 @@ private Option 
prepareRecord(HoodieRecord hoodieRecord) {
   }
 
   private HoodieRecord populateMetadataFields(HoodieRecord hoodieRecord, 
Schema schema, Properties prop) throws IOException {
-Map metadataValues = new HashMap<>();
-String seqId =
-HoodieRecord.generateSequenceId(instantTime, getPartitionId(), 
RECORD_COUNTER.getAndIncrement());
+MetadataValues metadataValues = new MetadataValues();
 if (config.populateMetaFields()) {
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.FILENAME_METADATA_FIELD.getFieldName(),
 fileId);
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.PARTITION_PATH_METADATA_FIELD.getFieldName(),
 partitionPath);
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName(),
 hoodieRecord.getRecordKey());
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.COMMIT_TIME_METADATA_FIELD.getFieldName(),
 instantTime);
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.COMMIT_SEQNO_METADATA_FIELD.getFieldName(),
 seqId);
+  String seqId =
+  HoodieRecord.generateSequenceId(instantTime, getPartitionId(), 
RECORD_COUNTER.getAndIncrement());
+  metadataValues.setFileName(fileId);
+  metadataValues.setPartitionPath(partitionPath);
+  metadataValues.setRecordKey(hoodieRecord.getRecordKey());
+  metadataValues.setCommitTime(instantTime);
+  metadataValues.setCommitSeqno(seqId);
 }
 if (config.allowOperationMetadataField()) {
-  
metadataValues.put(HoodieRecord.HoodieMetadataField.OPERATION_METADATA_FIELD.getFieldName(),
 hoodieRecord.getOperation().getName());
+  metadataValues.setOperation(hoodieRecord.getOperation().getName());
 }
 
-return hoodieRecord.updateValues(schema, prop, metadataValues);
+return hoodieRecord.updateMetadataValues(schema, prop, metadataValues);

Review Comment:
   Why do we need to update meta values if we're not populating them?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -295,12 +295,11 @@ object HoodieSparkSqlWriter {
   tblName, mapAsJavaMap(addSchemaEvolutionParameters(parameters, 
internalSchemaOpt) - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key)
 )).asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]]
 val writeConfig = client.getConfig
-if (writeConfig.getRecordMerger.getRecordType == 
HoodieRecordType.SPARK && tableType == HoodieTableType.MERGE_ON_READ &&

Review Comment:
   I think `HoodieTableType.MERGE_ON_READ` would be preferred



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogFileIterator.scala:
##
@@ -240,17 +236,22 @@ class RecordMergingFileIterator(split: 
HoodieMergeOnReadFileSplit,
   private def merge(curRow: InternalRow, newRecord: HoodieRecord[_]): 
Option[InternalRow] = {
 // NOTE: We have to pass in Avro Schema used to read from Delta Log file 
since we invoke combining API
 //   on the record from the Delta Log
+val curRecord = recordMerger.getRecordType match {
+  case HoodieRecordType.SPARK =>
+new HoodieSparkRecord(curRow, baseFileReader.schema)
+  case _ =>
+new HoodieAvroIndexedRecord(serial

[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables

2022-10-07 Thread GitBox


hudi-bot commented on PR #5958:
URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272185755

   
   ## CI report:
   
   * 0869c63d96180152d4b9a51f70d2c9d83bb95edd Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12060)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #6890: [WIP] Batch clean delete files retry

2022-10-07 Thread GitBox


nsivabalan opened a new pull request, #6890:
URL: https://github.com/apache/hudi/pull/6890

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling

2022-10-07 Thread GitBox


hudi-bot commented on PR #6889:
URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272169129

   
   ## CI report:
   
   * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12062)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling

2022-10-07 Thread GitBox


hudi-bot commented on PR #6889:
URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272167652

   
   ## CI report:
   
   * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2022-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4921:
-
Labels: pull-request-available  (was: )

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan opened a new pull request, #6889: [HUDI-4921] Fixing last completed commit with clean scheduling

2022-10-07 Thread GitBox


nsivabalan opened a new pull request, #6889:
URL: https://github.com/apache/hudi/pull/6889

   ### Change Logs
   
   While clean planning, we set last completed commit from the timeline. We 
just fetch the last completed commit, but it has to refer to last completed w/o 
any inflights in between. Fixing the same in this patch. 
   This will impact only multi-writer scenarios. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: low **
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2022-10-07 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4921:
--
Status: In Progress  (was: Open)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner

2022-10-07 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4921:
--
Status: Patch Available  (was: In Progress)

> Fix last completed commit in CleanPlanner
> -
>
> Key: HUDI-4921
> URL: https://issues.apache.org/jira/browse/HUDI-4921
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Recently we added last completed commit in as part of clean commit metadata. 
> ideally the value should represent the last completed commit in timeline 
> before er which there are no inflight commits. but we just get the last 
> completed commit in active timeline and setting the value. 
> this needs fixing. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3954) Don't keep the last commit before the earliest commit to retain

2022-10-07 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3954:
--
Sprint:   (was: 2022/10/04)

> Don't keep the last commit before the earliest commit to retain
> ---
>
> Key: HUDI-3954
> URL: https://issues.apache.org/jira/browse/HUDI-3954
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: 董可伦
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>
> Don't keep the last commit before the earliest commit to retain
> According to the document of {{{}hoodie.cleaner.commits.retained{}}}:
> Number of commits to retain, without cleaning. This will be retained for 
> num_of_commits * time_between_commits (scheduled). This also directly 
> translates into how much data retention the table supports for incremental 
> queries.
>  
> We only need to keep the number of commit configured through parameters 
> {{{}hoodie.cleaner.commits.retained{}}}.
> And the commit retained by clean is completed.This ensures that “This will be 
> retained for num_of_commits * time_between_commits” in the document.
> So we don't need to keep the last commit before the earliest commit to 
> retain,If we want to keep more versions, we can increase the parameters 
> {{hoodie.cleaner.commits.retained}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272144595

   
   ## CI report:
   
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056)
 
   * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN
   * 3342da1ce44cd5405714218b671cbf4863d2c6ff Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12061)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272142697

   
   ## CI report:
   
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056)
 
   * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN
   * 3342da1ce44cd5405714218b671cbf4863d2c6ff UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables

2022-10-07 Thread GitBox


hudi-bot commented on PR #5958:
URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272142215

   
   ## CI report:
   
   * 00eefd74074b2e0e04dc308ab9b775e09ed7803b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12035)
 
   * 0869c63d96180152d4b9a51f70d2c9d83bb95edd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12060)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272140645

   
   ## CI report:
   
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056)
 
   * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables

2022-10-07 Thread GitBox


hudi-bot commented on PR #5958:
URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272140065

   
   ## CI report:
   
   * 00eefd74074b2e0e04dc308ab9b775e09ed7803b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12035)
 
   * 0869c63d96180152d4b9a51f70d2c9d83bb95edd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-07 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272138058

   
   ## CI report:
   
   * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12057)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] 01/01: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 5d2c2853ea37ca8268f0d049460bee026216
Merge: 06d924137b 7d5b9dc0a9
Author: Alexey Kudinkin 
AuthorDate: Fri Oct 7 15:32:38 2022 -0700

[HUDI-4905] Improve type handling in proto schema conversion

 hudi-utilities/pom.xml |   1 -
 .../schema/ProtoClassBasedSchemaProvider.java  |  19 +-
 .../sources/helpers/ProtoConversionUtil.java   | 242 ++---
 .../schema/TestProtoClassBasedSchemaProvider.java  |  21 +-
 .../utilities/sources/TestProtoKafkaSource.java|   2 +-
 .../sources/helpers/TestProtoConversionUtil.java   | 100 +++--
 .../schema-provider/proto/oneof_schema.avsc|  42 
 .../resources/schema-provider/proto/sample.proto   |   8 +
 ..._flattened.avsc => sample_schema_defaults.avsc} |  31 ++-
 ...le_schema_wrapped_and_timestamp_as_record.avsc} |  16 +-
 10 files changed, 357 insertions(+), 125 deletions(-)



[hudi] branch master updated (06d924137b -> 5d2c2853ea)

2022-10-07 Thread akudinkin
This is an automated email from the ASF dual-hosted git repository.

akudinkin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 06d924137b [HUDI-2786] Docker demo on mac aarch64 (#6859)
 add 9c1fa14fd6 add support for unraveling proto schemas
 add 510d525e15 fix some compile issues
 add aad9ec1320 naming and style updates
 add 889927 make test data random, reuse code
 add a922a5beca add test for 2 different recursion depths, fix schema cache 
key
 add 3b37dc95d9 add unsigned long support
 add 706291d4f3 better handle other types
 add c28e874fca rebase on 4904
 add 190cc16381 get all tests working
 add f18fff886e fix oneof expected schema, update tests after rebase
 add ff5baa8706 revert scala binary change
 add 0069da2d1a try a different method to avoid avro version
 add 71a39bf488 Merge remote-tracking branch 'origin/master' into HUDI-4905
 add c5dff63375 delete unused file
 add f53d47ea3b address PR feedback, update decimal precision
 add 1831639e39 fix isNullable issue, check if class is Int64value
 add eca2992d65 checkstyle fix
 add 423da6f7bb change wrapper descriptor set initialization
 add fb2d9f0030 add in testing for unsigned long to BigInteger conversion
 add f03f9610cf shade protobuf dependency
 add 57f8b81194 Merge remote-tracking branch 'origin/master' into HUDI-4905
 add 7d5b9dc0a9 Revert "shade protobuf dependency"
 new 5d2c2853ea [HUDI-4905] Improve type handling in proto schema conversion

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 hudi-utilities/pom.xml |   1 -
 .../schema/ProtoClassBasedSchemaProvider.java  |  19 +-
 .../sources/helpers/ProtoConversionUtil.java   | 242 ++---
 .../schema/TestProtoClassBasedSchemaProvider.java  |  21 +-
 .../utilities/sources/TestProtoKafkaSource.java|   2 +-
 .../sources/helpers/TestProtoConversionUtil.java   | 100 +++--
 .../schema-provider/proto/oneof_schema.avsc|  42 
 .../resources/schema-provider/proto/sample.proto   |   8 +
 ..._flattened.avsc => sample_schema_defaults.avsc} |  31 ++-
 ...le_schema_wrapped_and_timestamp_as_record.avsc} |  16 +-
 10 files changed, 357 insertions(+), 125 deletions(-)
 create mode 100644 
hudi-utilities/src/test/resources/schema-provider/proto/oneof_schema.avsc
 rename 
hudi-utilities/src/test/resources/schema-provider/proto/{sample_schema_flattened.avsc
 => sample_schema_defaults.avsc} (92%)
 rename 
hudi-utilities/src/test/resources/schema-provider/proto/{sample_schema_nested.avsc
 => sample_schema_wrapped_and_timestamp_as_record.avsc} (95%)



[GitHub] [hudi] alexeykudinkin merged pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread GitBox


alexeykudinkin merged PR #6806:
URL: https://github.com/apache/hudi/pull/6806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272095799

   
   ## CI report:
   
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-07 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272095614

   
   ## CI report:
   
   * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017)
 
   * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12057)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #6887: [MINOR][DOCS] Fix docker_demo.md for 0.12.0

2022-10-07 Thread GitBox


nsivabalan commented on PR #6887:
URL: https://github.com/apache/hudi/pull/6887#issuecomment-1272085515

   this is already addressed in https://github.com/apache/hudi/pull/6860
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4976) Update docker demo website page to explain how to run on m1 macs

2022-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4976:
-
Labels: pull-request-available  (was: )

> Update docker demo website page to explain how to run on m1 macs
> 
>
> Key: HUDI-4976
> URL: https://issues.apache.org/jira/browse/HUDI-4976
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Update the website to reflect the changes done in the HUDI-2786 fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan merged pull request #6860: [HUDI-4976] added m1 changes to the site

2022-10-07 Thread GitBox


nsivabalan merged PR #6860:
URL: https://github.com/apache/hudi/pull/6860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [HUDI-4976] added m1 changes to the site (#6860)

2022-10-07 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new b662955f68 [HUDI-4976] added m1 changes to the site (#6860)
b662955f68 is described below

commit b662955f6885608961b0e8a83e1b841506087b88
Author: Jon Vexler 
AuthorDate: Fri Oct 7 17:03:53 2022 -0400

[HUDI-4976] added m1 changes to the site (#6860)
---
 website/docs/docker_demo.md| 63 +-
 .../versioned_docs/version-0.12.0/docker_demo.md   | 76 +++---
 2 files changed, 128 insertions(+), 11 deletions(-)

diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
index 698aec5439..681b1be51a 100644
--- a/website/docs/docker_demo.md
+++ b/website/docs/docker_demo.md
@@ -4,6 +4,8 @@ keywords: [ hudi, docker, demo]
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
 
 ## A Demo using Docker containers
 
@@ -58,6 +60,15 @@ The next step is to run the Docker compose script and setup 
configs for bringing
 
 This should pull the Docker images from Docker hub and setup the Docker 
cluster.
 
+
+
+
 ```java
 cd docker
 ./setup_demo.sh
@@ -118,6 +129,50 @@ Copying spark default config and setting up configs
 $ docker ps
 ```
 
+
+
+Please note that Presto and Trino do not currently work for the docker demo on 
Mac AArch64
+
+```java
+cd docker
+./setup_demo.sh --mac-aarch64
+...
+..
+[+] Running 12/12
+⠿ adhoc-1 Pulled  2.9s
+⠿ spark-worker-1 Pulled   3.0s
+⠿ kafka Pulled2.9s
+⠿ datanode1 Pulled2.9s
+⠿ hivemetastore Pulled2.9s
+⠿ hiveserver Pulled   3.0s
+⠿ hive-metastore-postgresql Pulled2.8s
+⠿ namenode Pulled 2.9s
+⠿ sparkmaster Pulled  2.9s
+⠿ zookeeper Pulled2.8s
+⠿ adhoc-2 Pulled  2.9s
+⠿ historyserver Pulled2.9s
+[+] Running 12/12
+⠿ Container zookeeper  Started   41.0s
+⠿ Container kafkabrokerStarted   41.7s
+⠿ Container hive-metastore-postgresql  Running0.0s
+⠿ Container namenode   Running0.0s
+⠿ Container hivemetastore  Running0.0s
+⠿ Container historyserver  Started   41.0s
+⠿ Container datanode1  Started   49.9s
+⠿ Container hiveserver Running0.0s
+⠿ Container sparkmasterStarted   41.9s
+⠿ Container spark-worker-1 Started   50.2s
+⠿ Container adhoc-2Started   38.5s
+⠿ Container adhoc-1Started   38.5s
+Copying spark default config and setting up configs
+Copying spark default config and setting up configs
+$ docker ps
+```
+
+
+ 
+
 At this point, the Docker cluster will be up and running. The demo cluster 
brings up the following services
 
* HDFS Services (NameNode, DataNode)
@@ -140,7 +195,9 @@ The batches are windowed intentionally so that the second 
batch contains updates
 
 ### Step 1 : Publish the first batch to Kafka
 
-Upload the first batch to Kafka topic 'stock ticks' `cat 
docker/demo/data/batch_1.json | kcat -b kafkabroker -t stock_ticks -P`
+Upload the first batch to Kafka topic 'stock ticks' 
+
+`cat docker/demo/data/batch_1.json | kcat -b kafkabroker -t stock_ticks -P`
 
 To check if the new topic shows up, use
 ```java
@@ -1137,7 +1194,7 @@ Compaction successfully completed for 20180924070031
 
 # Now refresh and check again. You will see that there is a new compaction 
requested
 
-hoodie:stock_ticks->refresh
+hoodie:stock_ticks_mor->refresh
 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading 
HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor
 18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from 
/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
@@ -1163,7 +1220,7 @@ hoodie:stock_ticks_mor->refresh
 18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of 
type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor
 Metadata for table stock_ticks_mor loaded
 
-hoodie:stock_ticks->compactions show all
+hoodie:stock_ticks_mor->compactions show all
 18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants 

[jira] [Updated] (HUDI-2786) Failed to connect to namenode in Docker Demo on Apple M1 chip

2022-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2786:
-
Labels: pull-request-available  (was: )

> Failed to connect to namenode in Docker Demo on Apple M1 chip
> -
>
> Key: HUDI-2786
> URL: https://issues.apache.org/jira/browse/HUDI-2786
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, dev-experience
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> {code:java}
> > ./setup_demo.sh 
> [+] Running 1/0
>  ⠿ compose  Warning: No resource found to remove                              
>                                                                               
>                                             0.0s
> [+] Running 15/15
>  ⠿ namenode Pulled                                                            
>                                                                               
>                                             1.4s
>  ⠿ kafka Pulled                                                               
>                                                                               
>                                             1.3s
>  ⠿ presto-worker-1 Pulled                                                     
>                                                                               
>                                             1.3s
>  ⠿ historyserver Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ adhoc-2 Pulled                                                             
>                                                                               
>                                             1.3s
>  ⠿ adhoc-1 Pulled                                                             
>                                                                               
>                                             1.4s
>  ⠿ graphite Pulled                                                            
>                                                                               
>                                             1.3s
>  ⠿ sparkmaster Pulled                                                         
>                                                                               
>                                             1.3s
>  ⠿ hive-metastore-postgresql Pulled                                           
>                                                                               
>                                             1.3s
>  ⠿ presto-coordinator-1 Pulled                                                
>                                                                               
>                                             1.3s
>  ⠿ spark-worker-1 Pulled                                                      
>                                                                               
>                                             1.4s
>  ⠿ hiveserver Pulled                                                          
>                                                                               
>                                             1.3s
>  ⠿ hivemetastore Pulled                                                       
>                                                                               
>                                             1.4s
>  ⠿ zookeeper Pulled                                                           
>                                                                               
>                                             1.3s
>  ⠿ datanode1 Pulled                                                           
>                                                                               
>                                             1.3s
> [+] Running 16/16
>  ⠿ Network compose_default              Created                               
>                                                                               
>                                             0.0s
>  ⠿ Container hive-metastore-postgresql  Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container kafkabroker                Started                               
>                                                                               
>                                             1.1s
>  ⠿ Container zookeeper                  Started                               
>    

[hudi] branch master updated: [HUDI-2786] Docker demo on mac aarch64 (#6859)

2022-10-07 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 06d924137b [HUDI-2786] Docker demo on mac aarch64 (#6859)
06d924137b is described below

commit 06d924137bbf216864ee4fa09018b325c8b0a636
Author: Jon Vexler 
AuthorDate: Fri Oct 7 17:02:09 2022 -0400

[HUDI-2786] Docker demo on mac aarch64 (#6859)
---
 ...pose_hadoop284_hive233_spark244_mac_aarch64.yml | 259 +
 docker/setup_demo.sh   |  10 +-
 docker/stop_demo.sh|   7 +-
 3 files changed, 272 insertions(+), 4 deletions(-)

diff --git 
a/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml 
b/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml
new file mode 100644
index 00..857180cfbe
--- /dev/null
+++ b/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml
@@ -0,0 +1,259 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+version: "3.3"
+
+services:
+
+  namenode:
+image: apachehudi/hudi-hadoop_2.8.4-namenode:linux-arm64-0.10.1
+platform: linux/arm64
+hostname: namenode
+container_name: namenode
+environment:
+  - CLUSTER_NAME=hudi_hadoop284_hive232_spark244
+ports:
+  - "50070:50070"
+  - "8020:8020"
+  # JVM debugging port (will be mapped to a random port on host)
+  - "5005"
+env_file:
+  - ./hadoop.env
+healthcheck:
+  test: [ "CMD", "curl", "-f", "http://namenode:50070"; ]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+
+  datanode1:
+image: apachehudi/hudi-hadoop_2.8.4-datanode:linux-arm64-0.10.1
+platform: linux/arm64
+container_name: datanode1
+hostname: datanode1
+environment:
+  - CLUSTER_NAME=hudi_hadoop284_hive232_spark244
+env_file:
+  - ./hadoop.env
+ports:
+  - "50075:50075"
+  - "50010:50010"
+  # JVM debugging port (will be mapped to a random port on host)
+  - "5005"
+links:
+  - "namenode"
+  - "historyserver"
+healthcheck:
+  test: [ "CMD", "curl", "-f", "http://datanode1:50075"; ]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+depends_on:
+  - namenode
+
+  historyserver:
+image: apachehudi/hudi-hadoop_2.8.4-history:latest
+hostname: historyserver
+container_name: historyserver
+environment:
+  - CLUSTER_NAME=hudi_hadoop284_hive232_spark244
+depends_on:
+  - "namenode"
+links:
+  - "namenode"
+ports:
+  - "58188:8188"
+healthcheck:
+  test: [ "CMD", "curl", "-f", "http://historyserver:8188"; ]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+env_file:
+  - ./hadoop.env
+volumes:
+  - historyserver:/hadoop/yarn/timeline
+
+  hive-metastore-postgresql:
+image: menorah84/hive-metastore-postgresql:2.3.0
+platform: linux/arm64
+environment:
+  - POSTGRES_HOST_AUTH_METHOD=trust
+volumes:
+  - hive-metastore-postgresql:/var/lib/postgresql
+hostname: hive-metastore-postgresql
+container_name: hive-metastore-postgresql
+
+  hivemetastore:
+image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:linux-arm64-0.10.1
+platform: linux/arm64
+hostname: hivemetastore
+container_name: hivemetastore
+links:
+  - "hive-metastore-postgresql"
+  - "namenode"
+env_file:
+  - ./hadoop.env
+command: /opt/hive/bin/hive --service metastore
+environment:
+  SERVICE_PRECONDITION: "namenode:50070 hive-metastore-postgresql:5432"
+ports:
+  - "9083:9083"
+  # JVM debugging port (will be mapped to a random port on host)
+  - "5005"
+healthcheck:
+  test: [ "CMD", "nc", "-z", "hivemetastore", "9083" ]
+  interval: 30s
+  timeout: 10s
+  retries: 3
+depends_on:
+  - "hive-metastore-postgresql"
+  - "namenode"
+
+  hiveserver:
+image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:linux-arm64-0.10.1
+platform: linux/arm64
+hostname: hiveserver
+container_name: hiveserver
+env_file:
+  - ./hadoop.env
+environment:

[GitHub] [hudi] nsivabalan merged pull request #6859: [HUDI-2786] Docker demo on mac aarch64

2022-10-07 Thread GitBox


nsivabalan merged PR #6859:
URL: https://github.com/apache/hudi/pull/6859


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated (a51181726c -> c5125d38b5)

2022-10-07 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from a51181726c [HUDI-4992] Fixing invalid min/max record key stats in 
Parquet metadata (#6883)
 add c5125d38b5 [HUDI-4972] Fixes to make unit tests work on m1 mac (#6751)

No new revisions were added by this update.

Summary of changes:
 hudi-examples/hudi-examples-java/pom.xml |  6 ++
 hudi-timeline-service/pom.xml|  5 +
 pom.xml  | 34 +++-
 3 files changed, 44 insertions(+), 1 deletion(-)



[GitHub] [hudi] nsivabalan merged pull request #6751: [HUDI-4972] Fixes to make unit tests work on m1

2022-10-07 Thread GitBox


nsivabalan merged PR #6751:
URL: https://github.com/apache/hudi/pull/6751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-10-07 Thread GitBox


hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272058386

   
   ## CI report:
   
   * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017)
 
   * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread GitBox


hudi-bot commented on PR #6806:
URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271997791

   
   ## CI report:
   
   * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12054)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271988433

   
   ## CI report:
   
   * 1ad6147e60eb5247411ad3965106891139c47148 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055)
 
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271977601

   
   ## CI report:
   
   * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053)
 
   * 1ad6147e60eb5247411ad3965106891139c47148 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055)
 
   * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271870496

   
   ## CI report:
   
   * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053)
 
   * 1ad6147e60eb5247411ad3965106891139c47148 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #6885: [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool

2022-10-07 Thread GitBox


nsivabalan commented on PR #6885:
URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271866369

   @pramodbiligiri : there is some failure in github actions. can you check on 
that please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271865581

   
   ## CI report:
   
   * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053)
 
   * 1ad6147e60eb5247411ad3965106891139c47148 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


hudi-bot commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271863767

   
   ## CI report:
   
   * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN
   * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN
   * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN
   * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12052)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4972) Some unit tests cannot run on mac m1

2022-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4972:
-
Labels: pull-request-available  (was: )

> Some unit tests cannot run on mac m1
> 
>
> Key: HUDI-4972
> URL: https://issues.apache.org/jira/browse/HUDI-4972
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> org.xerial.snappy is not compatible with m1 macs before version 1.1.8.2. 
> Additionally Spark is not compatible with m1 macs before version. 2.4.8 and 
> rocksdb-jini is not compatible before version 6.29.4.1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6751: [HUDI-4972] Fixes to make unit tests work on m1

2022-10-07 Thread GitBox


nsivabalan commented on code in PR #6751:
URL: https://github.com/apache/hudi/pull/6751#discussion_r990326965


##
packaging/hudi-utilities-bundle/pom.xml:
##
@@ -376,6 +376,12 @@
   org.apache.parquet
   parquet-avro
   compile
+  
+
+  org.xerial.snappy
+  snappy-java

Review Comment:
   we test this and things are good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread GitBox


hudi-bot commented on PR #6806:
URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271803780

   
   ## CI report:
   
   * 57f8b811946c7c013cbc31e9bd8db469f70cda2a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12036)
 
   * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12054)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271799001

   
   ## CI report:
   
   * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread GitBox


hudi-bot commented on PR #6806:
URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271798724

   
   ## CI report:
   
   * 57f8b811946c7c013cbc31e9bd8db469f70cda2a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12036)
 
   * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4995:

Priority: Minor  (was: Major)

> Depency conflicts on apache http with other projects
> 
>
> Key: HUDI-4995
> URL: https://issues.apache.org/jira/browse/HUDI-4995
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Minor
> Fix For: 0.12.1
>
>
> Hudi imports org.apache.http which can collide with other libs such 
> elasticsearch client. This makes the spark-bundle create conflicts when use 
> both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nicolas paris updated HUDI-4995:

Fix Version/s: 0.12.1

> Depency conflicts on apache http with other projects
> 
>
> Key: HUDI-4995
> URL: https://issues.apache.org/jira/browse/HUDI-4995
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: nicolas paris
>Priority: Major
> Fix For: 0.12.1
>
>
> Hudi imports org.apache.http which can collide with other libs such 
> elasticsearch client. This makes the spark-bundle create conflicts when use 
> both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions

2022-10-07 Thread GitBox


hudi-bot commented on PR #6888:
URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271793283

   
   ## CI report:
   
   * ed246a4e7ff10fe5c2ba9720823873441e1b5831 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271791942

   
   ## CI report:
   
   * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12051)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4995) Depency conflicts on apache http with other projects

2022-10-07 Thread nicolas paris (Jira)
nicolas paris created HUDI-4995:
---

 Summary: Depency conflicts on apache http with other projects
 Key: HUDI-4995
 URL: https://issues.apache.org/jira/browse/HUDI-4995
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: nicolas paris


Hudi imports org.apache.http which can collide with other libs such 
elasticsearch client. This makes the spark-bundle create conflicts when use 
both libs. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion

2022-10-07 Thread GitBox


alexeykudinkin commented on code in PR #6806:
URL: https://github.com/apache/hudi/pull/6806#discussion_r990227333


##
hudi-utilities/pom.xml:
##
@@ -85,7 +85,6 @@
 
   com.google.protobuf
   protobuf-java-util
-  test

Review Comment:
   Yes, let's keep changes bounded -- shading is the right change but there's 
no reason to take it in this PR, since we're adding Proto deps to bundles



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-4992) Spark Row-writing Bulk Insert produces incorrect Bloom Filter metadata

2022-10-07 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4992.

Resolution: Fixed

> Spark Row-writing Bulk Insert produces incorrect Bloom Filter metadata
> --
>
> Key: HUDI-4992
> URL: https://issues.apache.org/jira/browse/HUDI-4992
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Troubleshooting duplicates issue w/ Abhishek Modi from Notion, we've found 
> that the min/max record key stats are being currently persisted incorrectly 
> into Parquet metadata, leading to duplicate records being produced in their 
> pipeline after initial bulk-insert.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4472) Revisit schema handling in HoodieSparkSqlWriter

2022-10-07 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4472:
-
Reviewers: Ethan Guo, Raymond Xu, sivabalan narayanan

> Revisit schema handling in HoodieSparkSqlWriter
> ---
>
> Key: HUDI-4472
> URL: https://issues.apache.org/jira/browse/HUDI-4472
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> After many features aimed to bring more and more sophisticated support of 
> schema evolution were layered in w/in HoodieSparkSqlWriter, it's currently 
> requiring careful attention to reconcile many flows and make sure that the 
> original invariants still hold.
>  
> One example of the issue was discovered while addressing HUDI-4081 (which was 
> duct-typed in [#6213|https://github.com/apache/hudi/pull/6213/files#] to 
> avoid substantial changes before the release)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xushiyan commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0

2022-10-07 Thread GitBox


xushiyan commented on issue #6692:
URL: https://github.com/apache/hudi/issues/6692#issuecomment-1271738970

   i think the issue originated from hadoop-mr-bundle shaded 
`org.apache.avro.*` package. @eshu can you list out all the jars you add to the 
classpath? specifically the hudi jars


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4982) Make bundle combination testing covered in CI

2022-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4982:
-
Labels: pull-request-available  (was: )

> Make bundle combination testing covered in CI
> -
>
> Key: HUDI-4982
> URL: https://issues.apache.org/jira/browse/HUDI-4982
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Raymond Xu
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jonvex opened a new pull request, #6888: [HUDI-4982] [DO NOT MERGE] add spark bundle tests to github actions

2022-10-07 Thread GitBox


jonvex opened a new pull request, #6888:
URL: https://github.com/apache/hudi/pull/6888

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-4975) datahub sync bundle causes class loading issue

2022-10-07 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614121#comment-17614121
 ] 

Raymond Xu commented on HUDI-4975:
--

Root-caused the issue:

when using datahub-sync-bundle built with spark3.3 profile, it's expecting to 
work with 

{code}
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
{code}

which does not exist in parquet 1.10.1, which is used by spark 3.1

This means datahub-sync-bundle (and possibly other sync bundles) are not fully 
decouple from spark profiles. This can be mitigated by putting spark-bundle 
first in the classpath but we should eliminate the root issue.

> datahub sync bundle causes class loading issue
> --
>
> Key: HUDI-4975
> URL: https://issues.apache.org/jira/browse/HUDI-4975
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
> Fix For: 0.12.2
>
>
> run utilities-slim.jar as the main jar for deltastreamer
> set --jars 
> /tmp/hudi-datahub-sync-bundle-0.12.1-rc1.jar,/tmp/hudi-spark3.1-bundle_2.12-0.12.1-rc1.jar
> put datahub sync bundle before spark bundle resulted in class loader issue. 
> works fine if spark bundle goes first
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation
>   at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.newParquetFileWriter(HoodieFileWriterFactory.java:78)
>   at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.newParquetFileWriter(HoodieFileWriterFactory.java:70)
>   at 
> org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriter(HoodieFileWriterFactory.java:54)
>   at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:104)
>   at 
> org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:76)
>   at 
> org.apache.hudi.io.CreateHandleFactory.create(CreateHandleFactory.java:46)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:83)
>   at 
> org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40)
>   at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>   at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:135)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.parquet.schema.LogicalTypeAnnotation
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6886: [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities

2022-10-07 Thread GitBox


hudi-bot commented on PR #6886:
URL: https://github.com/apache/hudi/pull/6886#issuecomment-1271727870

   
   ## CI report:
   
   * 832c98a95c9482d24389ee2f2052893097bfdeda Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12049)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


zhangyue19921010 commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271721244

   Hi @alexeykudinkin and @nsivabalan, Really appreciate for your review and 
comments! Also sorry for taking a few days to finish these comments. 
   
   Please take a look at your convince :)
   
   If there are any omissions or other things that need to be modified, please 
feel free to let me know! Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6885: [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool

2022-10-07 Thread GitBox


hudi-bot commented on PR #6885:
URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271715586

   
   ## CI report:
   
   * 0684a683ea55e513e8c896e5748270a9e5953d44 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12048)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


hudi-bot commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271713313

   
   ## CI report:
   
   * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN
   * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN
   * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN
   * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12052)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


zhangyue19921010 commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271688530

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271623904

   
   ## CI report:
   
   * 730f5b91c206267dc89e471732f56679424c7391 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050)
 
   * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12051)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271616397

   
   ## CI report:
   
   * e391adad3fa33145e8814160107d2afbcb450597 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045)
 
   * 730f5b91c206267dc89e471732f56679424c7391 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050)
 
   * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


hudi-bot commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271613867

   
   ## CI report:
   
   * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN
   * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN
   * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN
   * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4994) DatahubSyncTool does not correctly re-ingest soft-deleted entities

2022-10-07 Thread Pramod Biligiri (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-4994:
--
Description: 
Datahub has a notion of soft-deletes (the entity still exists in the database 
with a status=removed:true). Such entities could get re-ingested with new 
properties at a later time, such that the older one gets overwritten. The 
current implementation in DatahubSyncTool does not handle this scenario. It 
fails to update the status flag to removed:false during ingest, which means the 
entity won't surface in the Datahub UI at all.

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
[https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]

  was:
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI.

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default

Summary: DatahubSyncTool does not correctly re-ingest soft-deleted 
entities  (was: DatahubSyncTool should set "removed" status of an entity to 
false when updating it)

> DatahubSyncTool does not correctly re-ingest soft-deleted entities
> --
>
> Key: HUDI-4994
> URL: https://issues.apache.org/jira/browse/HUDI-4994
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Pramod Biligiri
>Priority: Major
>  Labels: pull-request-available
>
> Datahub has a notion of soft-deletes (the entity still exists in the database 
> with a status=removed:true). Such entities could get re-ingested with new 
> properties at a later time, such that the older one gets overwritten. The 
> current implementation in DatahubSyncTool does not handle this scenario. It 
> fails to update the status flag to removed:false during ingest, which means 
> the entity won't surface in the Datahub UI at all.
> Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
> [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271542716

   
   ## CI report:
   
   * e391adad3fa33145e8814160107d2afbcb450597 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045)
 
   * 730f5b91c206267dc89e471732f56679424c7391 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271536505

   
   ## CI report:
   
   * e391adad3fa33145e8814160107d2afbcb450597 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045)
 
   * 730f5b91c206267dc89e471732f56679424c7391 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-10-07 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271530268

   
   ## CI report:
   
   * e391adad3fa33145e8814160107d2afbcb450597 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan opened a new pull request, #6887: [MINOR][DOCS] Fix docker_demo.md for 0.12.0

2022-10-07 Thread GitBox


xushiyan opened a new pull request, #6887:
URL: https://github.com/apache/hudi/pull/6887

   ### Change Logs
   
   Update 0.12.0 docs.
   
   ### Impact
   
   **Risk level: none**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6875: [SUPPORT] - [Docker Demo] - Partition key parts [dt] does not match with partition values [2018, 08, 31]

2022-10-07 Thread GitBox


xushiyan commented on issue #6875:
URL: https://github.com/apache/hudi/issues/6875#issuecomment-1271502980

   @pavimotorq i think the website is not updated somehow. In the latest docs 
https://hudi.apache.org/docs/next/docker_demo we have updated the commands to 
have `  --partition-value-extractor 
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`. Please follow 
that guide. cc @bhasudha we may need to backfill this change for 0.12.0 version 
of docs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #6881: Processing time is increased with hudi metadata enable

2022-10-07 Thread GitBox


xushiyan commented on issue #6881:
URL: https://github.com/apache/hudi/issues/6881#issuecomment-1271484448

   @koochiswathiTR is the processing time of 15min for every commit after 
enabling metadata table? please also provide hudi/spark versions, workload info 
like how many records each commit and how many are updates. And environment 
info as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6886: [HUDI-4994] Adding undo of soft-delete to upsert code flow

2022-10-07 Thread GitBox


hudi-bot commented on PR #6886:
URL: https://github.com/apache/hudi/pull/6886#issuecomment-1271468381

   
   ## CI report:
   
   * 832c98a95c9482d24389ee2f2052893097bfdeda Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12049)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6885: [HUDI-4993] Making DataPlatform name and Dataset env configurable

2022-10-07 Thread GitBox


hudi-bot commented on PR #6885:
URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271468346

   
   ## CI report:
   
   * 0684a683ea55e513e8c896e5748270a9e5953d44 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12048)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-10-07 Thread GitBox


hudi-bot commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271466614

   
   ## CI report:
   
   * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN
   * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN
   * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN
   * 4587303118918c5e56ecb10732d9fcba43a90ee7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12044)
 
   * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [MINOR][DOCS] remove validation code from quickstart examples (#6879)

2022-10-07 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e013d654da [MINOR][DOCS] remove validation code from quickstart 
examples (#6879)
e013d654da is described below

commit e013d654da67fedca5a24026ab93e8fc34f5b3a8
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Fri Oct 7 19:22:42 2022 +0800

[MINOR][DOCS] remove validation code from quickstart examples (#6879)
---
 website/docs/quick-start-guide.md | 40 ---
 1 file changed, 40 deletions(-)

diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index 4beb9577c5..ed7bb29698 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -189,7 +189,6 @@ import org.apache.hudi.common.model.HoodieRecord
 val tableName = "hudi_trips_cow"
 val basePath = "file:///tmp/hudi_trips_cow"
 val dataGen = new DataGenerator
-val snapshotQuery = "SELECT begin_lat, begin_lon, driver, end_lat, end_lon, 
fare, partitionpath, rider, ts, uuid FROM hudi_ro_table"
 ```
 
 
@@ -200,7 +199,6 @@ val snapshotQuery = "SELECT begin_lat, begin_lon, driver, 
end_lat, end_lon, fare
 tableName = "hudi_trips_cow"
 basePath = "file:///tmp/hudi_trips_cow"
 dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
-snapshotQuery = "SELECT begin_lat, begin_lon, driver, end_lat, end_lon, fare, 
partitionpath, rider, ts, uuid FROM hudi_ro_table"
 ```
 
 
@@ -429,9 +427,6 @@ df.write.format("hudi").
   option(TABLE_NAME, tableName).
   mode(Overwrite).
   save(basePath)
-  
-// validations
-assert(df.except(spark.sql(snapshotQuery)).count() == 0)
 ```
 :::info
 `mode(Overwrite)` overwrites and recreates the table if it already exists.
@@ -468,9 +463,6 @@ df.write.format("hudi"). \
 options(**hudi_options). \
 mode("overwrite"). \
 save(basePath)
-
-# validations
-assert spark.sql(snapshotQuery).exceptAll(df).count() == 0
 ```
 :::info
 `mode(Overwrite)` overwrites and recreates the table if it already exists.
@@ -713,7 +705,6 @@ values={[
 
 ```scala
 // spark-shell
-val snapBeforeUpdate = spark.sql(snapshotQuery)
 val updates = convertToStringList(dataGen.generateUpdates(10))
 val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
 df.write.format("hudi").
@@ -724,10 +715,6 @@ df.write.format("hudi").
   option(TABLE_NAME, tableName).
   mode(Append).
   save(basePath)
-  
-// validations
-assert(spark.sql(snapshotQuery).intersect(df).count() == df.count())
-assert(spark.sql(snapshotQuery).except(df).except(snapBeforeUpdate).count() == 
0)
 ```
 :::note
 Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the table for the first time.
@@ -816,17 +803,12 @@ when not matched then
 
 ```python
 # pyspark
-snapshotBeforeUpdate = spark.sql(snapshotQuery)
 updates = 
sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
 df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
 df.write.format("hudi"). \
   options(**hudi_options). \
   mode("append"). \
   save(basePath)
-  
-# validations
-assert spark.sql(snapshotQuery).intersect(df).count() == df.count()
-assert 
spark.sql(snapshotQuery).exceptAll(snapshotBeforeUpdate).exceptAll(df).count() 
== 0
 ```
 :::note
 Notice that the save mode is now `Append`. In general, always use append mode 
unless you are trying to create the table for the first time.
@@ -1122,7 +1104,6 @@ Delete records for the HoodieKeys passed in.
 
 ```scala
 // spark-shell
-val snapshotBeforeDelete = spark.sql(snapshotQuery)
 // fetch total records count
 spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
 // fetch two records to be deleted
@@ -1151,10 +1132,6 @@ val roAfterDeleteViewDF = spark.
 roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
 // fetch should return (total - 2) records
 spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
-
-// validations
-assert(spark.sql("select uuid, partitionpath, ts from 
hudi_trips_snapshot").intersect(hardDeleteDf).count() == 0)
-assert(snapshotBeforeDelete.except(spark.sql("select uuid, partitionpath, ts 
from hudi_trips_snapshot")).except(snapshotBeforeDelete).count() == 0)
 ```
 :::note
 Only `Append` mode is supported for delete operation.
@@ -1182,7 +1159,6 @@ Delete records for the HoodieKeys passed in.
 
 ```python
 # pyspark
-snapshotBeforeDelete = spark.sql(snapshotQuery)
 # fetch total records count
 spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
 # fetch two records to be deleted
@@ -1216,10 +1192,6 @@ roAfterDeleteViewDF = spark. \
 roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
 # fetch should return (total - 2) records
 spark.sql("select uuid, partitionpath from hudi_

  1   2   >