[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex
YuweiXiao commented on code in PR #6680: URL: https://github.com/apache/hudi/pull/6680#discussion_r990601721 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -179,15 +197,125 @@ public void close() throws Exception { } protected List getAllQueryPartitionPaths() { +if (cachedAllPartitionPaths != null) { + return cachedAllPartitionPaths; +} + +loadAllQueryPartitionPaths(); +return cachedAllPartitionPaths; + } + + private void loadAllQueryPartitionPaths() { List queryRelativePartitionPaths = queryPaths.stream() .map(path -> FSUtils.getRelativePartitionPath(basePath, path)) .collect(Collectors.toList()); -// Load all the partition path from the basePath, and filter by the query partition path. -// TODO load files from the queryRelativePartitionPaths directly. -List matchedPartitionPaths = getAllPartitionPathsUnchecked() -.stream() -.filter(path -> queryRelativePartitionPaths.stream().anyMatch(path::startsWith)) +this.cachedAllPartitionPaths = listQueryPartitionPaths(queryRelativePartitionPaths); + +// If the partition value contains InternalRow.empty, we query it as a non-partitioned table. +this.queryAsNonePartitionedTable = this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0); + } + + protected Map> getAllInputFileSlices() { +if (!isAllInputFileSlicesCached) { Review Comment: Yeah, good point. 1) generalize to batch get 2) load only remaining partitions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex
YuweiXiao commented on code in PR #6680: URL: https://github.com/apache/hudi/pull/6680#discussion_r990601573 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -179,15 +197,125 @@ public void close() throws Exception { } protected List getAllQueryPartitionPaths() { +if (cachedAllPartitionPaths != null) { + return cachedAllPartitionPaths; +} + +loadAllQueryPartitionPaths(); Review Comment: Yes, you are right. I will have it inlined. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272242144 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * f8732300afaf355296ca13fe7f2d3e9a131315d6 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12063) * 18ef7b44488dff256728b2bba024b4a4d00aebe9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272241019 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * 1d98224805b75fc0c9c8ec54948870e96c4b54e7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12043) * f8732300afaf355296ca13fe7f2d3e9a131315d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12063) * 18ef7b44488dff256728b2bba024b4a4d00aebe9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1272240117 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * 1d98224805b75fc0c9c8ec54948870e96c4b54e7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12043) * f8732300afaf355296ca13fe7f2d3e9a131315d6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
alexeykudinkin commented on code in PR #6358: URL: https://github.com/apache/hudi/pull/6358#discussion_r990595189 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -185,7 +185,7 @@ public class HoodieWriteConfig extends HoodieConfig { public static final ConfigProperty AVRO_SCHEMA_VALIDATE_ENABLE = ConfigProperty .key("hoodie.avro.schema.validate") - .defaultValue("false") + .defaultValue("true") Review Comment: This is flipped to default to make sure proper schema validation are run for every operation on the table ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java: ## @@ -81,20 +79,7 @@ public static IgnoreRecord IGNORE_RECORD = new IgnoreRecord(); /** - * The specified schema of the table. ("specified" denotes that this is configured by the client, - * as opposed to being implicitly fetched out of the commit metadata) - */ - protected final Schema tableSchema; - protected final Schema tableSchemaWithMetaFields; Review Comment: These fields were misused and are redundant, hence deleted ## hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaCompatibility.java: ## @@ -0,0 +1,941 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.avro; + +import org.apache.avro.AvroRuntimeException; +import org.apache.avro.Schema; +import org.apache.avro.Schema.Field; +import org.apache.avro.Schema.Type; +import org.apache.hudi.common.util.Either; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.Method; +import java.util.ArrayDeque; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Deque; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.TreeSet; +import java.util.stream.Collectors; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; + +/** + * Evaluate the compatibility between a reader schema and a writer schema. A + * reader and a writer schema are declared compatible if all datum instances of + * the writer schema can be successfully decoded using the specified reader + * schema. + * + * NOTE: PLEASE READ CAREFULLY BEFORE CHANGING + * + * This code is borrowed from Avro 1.10, with the following modifications: + * + * Compatibility checks ignore schema name, unless schema is held inside + * a union + * + * + */ +public class AvroSchemaCompatibility { Review Comment: Context: Avro requires at all times that schema's names have to match in order for them to be counted as compatible. Provided that only Avro bears the names on the schemas themselves (Spark does not, for ex) this makes for ex, some schemas converted from Spark's [[StructType]] incompatible w/ Avro This has code is mostly borrowed as is from Avro 1.10 w/ the following critical adjustments: Schema names now are only checked in following 2 cases: - In case it's a top-level schema - In case schema is enclosed into a union (in which case its name might be used for reverse-lookup) ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseMergeHelper.java: ## @@ -18,91 +18,47 @@ package org.apache.hudi.table.action.commit; +import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; import org.apache.hudi.avro.HoodieAvroUtils; import org.apache.hudi.client.utils.MergingIterator; -import org.apache.hudi.common.model.HoodieBaseFile; -import org.apache.hudi.common.model.HoodieRecordPayload; import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer; -import org.apache.hudi.exception.HoodieException; import org.apache.hudi.io.HoodieMergeHandle; import org.apache.hudi.io.storage.HoodieFileReader; import org.apache.hudi.io.storage.HoodieFileReaderFactory; import org.apache.hudi.table.HoodieTable; -import org.apache.avro.ge
[GitHub] [hudi] xushiyan commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0
xushiyan commented on issue #6692: URL: https://github.com/apache/hudi/issues/6692#issuecomment-1272232597 > @xushiyan I use the fat jar, but I do not know what is added to the classpath by AWS in Glue 3. @eshu by fat jar you mean bundle jar? Is it Hudi Spark 3.1 bundle? And it's the only bundle you used? I need to reproduce this by putting the same jar as you did. So pls provide info on what jars you added to your glue job. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eshu commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0
eshu commented on issue #6692: URL: https://github.com/apache/hudi/issues/6692#issuecomment-1272232164 @xushiyan I use the fat jar, but I do not know what is added to the classpath by AWS in Glue 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on PR #6818: URL: https://github.com/apache/hudi/pull/6818#issuecomment-1272218777 > this pr will support cdc data block's flushing and cdc log file's rollover. this features need to upgrade the write stat about cdc, that is the key point need to be discuss. > > there maybe are solutions: > > 1. like this pr: both `cdcPaths` and `cdcWriteBytes` are the `list` data type. > 2. use a map, like: > > ``` > cdcWriteStats: { > "cdclogfile1": cdclogFile1Size, > "cdclogfile1": cdclogFile1Size > } > ``` > > cc @xushiyan @alexeykudinkin @danny0405 WDYH? What is the file size used for ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on code in PR #6818: URL: https://github.com/apache/hudi/pull/6818#discussion_r990585049 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordIterator.java: ## @@ -27,50 +27,94 @@ import org.apache.avro.generic.IndexedRecord; import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; import java.io.IOException; +import java.util.Arrays; +import java.util.Iterator; +import java.util.concurrent.atomic.AtomicInteger; public class HoodieCDCLogRecordIterator implements ClosableIterator { - private final HoodieLogFile cdcLogFile; + private final FileSystem fs; - private final HoodieLogFormat.Reader reader; + private final Schema cdcSchema; + + private final Iterator cdcLogFileIter; + + private HoodieLogFormat.Reader reader; + + /** + * Due to the hasNext of {@link HoodieLogFormat.Reader} is not idempotent, + * Here guarantee idempotent by `hasNextCall` and `nextCall`. + */ + private final AtomicInteger hasNextCall = new AtomicInteger(0); + private final AtomicInteger nextCall = new AtomicInteger(0); private ClosableIterator itr; - public HoodieCDCLogRecordIterator( - FileSystem fs, - Path cdcLogPath, - Schema cdcSchema) throws IOException { -this.cdcLogFile = new HoodieLogFile(fs.getFileStatus(cdcLogPath)); -this.reader = new HoodieLogFileReader(fs, cdcLogFile, cdcSchema, -HoodieLogFileReader.DEFAULT_BUFFER_SIZE, false); + public HoodieCDCLogRecordIterator(FileSystem fs, HoodieLogFile[] cdcLogFiles, Schema cdcSchema) { +this.fs = fs; +this.cdcSchema = cdcSchema; +this.cdcLogFileIter = Arrays.stream(cdcLogFiles).iterator(); } Review Comment: Do we have some sort sequence for these files ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on code in PR #6818: URL: https://github.com/apache/hudi/pull/6818#discussion_r990584901 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieCDCLogRecordIterator.java: ## @@ -27,50 +27,94 @@ import org.apache.avro.generic.IndexedRecord; import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; import java.io.IOException; +import java.util.Arrays; +import java.util.Iterator; +import java.util.concurrent.atomic.AtomicInteger; public class HoodieCDCLogRecordIterator implements ClosableIterator { - private final HoodieLogFile cdcLogFile; + private final FileSystem fs; - private final HoodieLogFormat.Reader reader; + private final Schema cdcSchema; + + private final Iterator cdcLogFileIter; + + private HoodieLogFormat.Reader reader; + + /** + * Due to the hasNext of {@link HoodieLogFormat.Reader} is not idempotent, + * Here guarantee idempotent by `hasNextCall` and `nextCall`. + */ + private final AtomicInteger hasNextCall = new AtomicInteger(0); + private final AtomicInteger nextCall = new AtomicInteger(0); Review Comment: We can avoid these two variables by a `currentRecord` reference from the current iterator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on code in PR #6818: URL: https://github.com/apache/hudi/pull/6818#discussion_r990584548 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java: ## @@ -89,9 +94,19 @@ protected void writeInsertRecord(HoodieRecord hoodieRecord, Option close() { List writeStatuses = super.close(); // if there are cdc data written, set the CDC-related information. -Option cdcResult = -HoodieCDCLogger.writeCDCDataIfNeeded(cdcLogger, recordsWritten, insertRecordsWritten); -HoodieCDCLogger.setCDCStatIfNeeded(writeStatuses.get(0).getStat(), cdcResult, partitionPath, fs); + +if (cdcLogger == null || recordsWritten == 0L || (recordsWritten == insertRecordsWritten)) { + // the following cases where we do not need to write out the cdc file: Review Comment: The if condition is not suitable for Flink, we may need some change for flink cdc handles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on code in PR #6818: URL: https://github.com/apache/hudi/pull/6818#discussion_r990584508 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieWriteStat.java: ## @@ -254,12 +256,12 @@ public String getPath() { } @Nullable - public String getCdcPath() { -return cdcPath; + public List getCdcPaths() { +return cdcPaths; } - public void setCdcPath(String cdcPath) { -this.cdcPath = cdcPath; + public void setCdcPath(List cdcPaths) { +this.cdcPaths = cdcPaths; Review Comment: setCdcPath -> setCdcPaths -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6818: [HUDI-4948] Improve CDC Write
danny0405 commented on code in PR #6818: URL: https://github.com/apache/hudi/pull/6818#discussion_r990583580 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java: ## @@ -73,35 +80,56 @@ public class HoodieCDCLogger implements Closeable { private final Schema cdcSchema; - private final String cdcSchemaString; - // the cdc data private final Map cdcData; + private final Map cdcDataBlockHeader; + // the cdc record transformer private final CDCTransformer transformer; + // Max block size to limit to for a log block + private final int maxBlockSize; + + // Average cdc record size. This size is updated at the end of every log block flushed to disk + private long averageCDCRecordSize = 0; + + // Number of records that must be written to meet the max block size for a log block + private AtomicInteger numOfCDCRecordInMemory = new AtomicInteger(); + Review Comment: `numOfCDCRecordInMemory` -> `numOfCDCRecordsInMemory` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6845: [HUDI-4945] Support to trigger the clean in the flink batch mode.
danny0405 commented on code in PR #6845: URL: https://github.com/apache/hudi/pull/6845#discussion_r990583385 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSink.java: ## @@ -96,9 +96,10 @@ public SinkRuntimeProvider getSinkRuntimeProvider(Context context) { pipeline = Pipelines.hoodieStreamWrite(conf, hoodieRecordDataStream); // compaction if (OptionsResolver.needsAsyncCompaction(conf)) { -// use synchronous compaction for bounded source. +// use synchronous compaction and clean for bounded source. if (context.isBounded()) { conf.setBoolean(FlinkOptions.COMPACTION_ASYNC_ENABLED, false); + conf.setBoolean(FlinkOptions.CLEAN_ASYNC_ENABLED, false); } Review Comment: Yeah, thanks for the explanation, can we try to add a test case here ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config
danny0405 commented on code in PR #6856: URL: https://github.com/apache/hudi/pull/6856#discussion_r990583099 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java: ## @@ -279,8 +279,9 @@ private FlinkOptions() { .defaultValue(false)// default read as batch .withDescription("Whether to skip compaction instants for streaming read,\n" + "there are two cases that this option can be used to avoid reading duplicates:\n" - + "1) you are definitely sure that the consumer reads faster than any compaction instants, " - + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n" + + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the " + + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is " Review Comment: Thanks for the enhancement, we need the compaction to preserve the commit time metadata field, and it is by default true. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling
hudi-bot commented on PR #6889: URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272214966 ## CI report: * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12062) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row
[ https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron resolved HUDI-4949. -- > Optimize cdc read to avoid problems that caused by reusing buffer underlying > the Row > > > Key: HUDI-4949 > URL: https://issues.apache.org/jira/browse/HUDI-4949 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row
[ https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron updated HUDI-4949: - Fix Version/s: 0.13.0 > Optimize cdc read to avoid problems that caused by reusing buffer underlying > the Row > > > Key: HUDI-4949 > URL: https://issues.apache.org/jira/browse/HUDI-4949 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4949) Optimize cdc read to avoid problems that caused by reusing buffer underlying the Row
[ https://issues.apache.org/jira/browse/HUDI-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron reassigned HUDI-4949: Assignee: Yann Byron > Optimize cdc read to avoid problems that caused by reusing buffer underlying > the Row > > > Key: HUDI-4949 > URL: https://issues.apache.org/jira/browse/HUDI-4949 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-4915) Spark Avro SerDe returns wrong result upon multiple calls
[ https://issues.apache.org/jira/browse/HUDI-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron closed HUDI-4915. Resolution: Won't Fix > Spark Avro SerDe returns wrong result upon multiple calls > - > > Key: HUDI-4915 > URL: https://issues.apache.org/jira/browse/HUDI-4915 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > Currently, Spark Avro serializer/deserializer has a bug that it will return > the same object when we call this method twice continuously. For example: > val row1: InternalRow = ... > val row2: InternalRow = ... // record2 is different with record1 > > val serializeredRecord1 = serialize(row1) > val serializeredRecord2 = serialize(row2) > serializeredRecord1.equals(serializeredRecord2) > > That is because we use the `val` to declare the serializer/deserializer > methods, so the latter's result will cover the previous one. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4857) Replace DataFrame with HoodieData in Spark side
[ https://issues.apache.org/jira/browse/HUDI-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An reassigned HUDI-4857: Assignee: (was: Hui An) > Replace DataFrame with HoodieData in Spark side > --- > > Key: HUDI-4857 > URL: https://issues.apache.org/jira/browse/HUDI-4857 > Project: Apache Hudi > Issue Type: Improvement > Components: code-quality, spark >Reporter: Hui An >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] YannByron opened a new pull request, #6891: [MINOR] update committer list
YannByron opened a new pull request, #6891: URL: https://github.com/apache/hudi/pull/6891 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] felixYyu commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`
felixYyu commented on code in PR #5064: URL: https://github.com/apache/hudi/pull/5064#discussion_r990579376 ## hudi-common/src/main/java/org/apache/hudi/common/table/catalog/FileBasedMetaClient.java: ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.catalog; + +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.PathFilter; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.fs.ConsistencyGuardConfig; +import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.fs.FailSafeConsistencyGuard; +import org.apache.hudi.common.fs.FileSystemRetryConfig; +import org.apache.hudi.common.fs.HoodieRetryWrapperFileSystem; +import org.apache.hudi.common.fs.HoodieWrapperFileSystem; +import org.apache.hudi.common.fs.NoOpConsistencyGuard; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.TimelineLayout; +import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion; +import org.apache.hudi.common.util.StringUtils; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.exception.TableNotFoundException; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +import java.io.IOException; +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +public class FileBasedMetaClient implements HoodieMetaClient, Serializable { + private static final long serialVersionUID = 1L; + private static final Logger LOG = LogManager.getLogger(FileBasedMetaClient.class); + public static final String METAFOLDER_NAME = ".hoodie"; + public static final String AUXILIARYFOLDER_NAME = METAFOLDER_NAME + Path.SEPARATOR + ".aux"; + public static final String SCHEMA_FOLDER_NAME = ".schema"; + + private SerializableConfiguration hadoopConf; + private ConsistencyGuardConfig consistencyGuardConfig; + private FileSystemRetryConfig fileSystemRetryConfig; + + public FileBasedMetaClient(SerializableConfiguration hadoopConf) { +this.hadoopConf = hadoopConf; +this.consistencyGuardConfig = ConsistencyGuardConfig.newBuilder().build(); +this.fileSystemRetryConfig = FileSystemRetryConfig.newBuilder().build(); + } + + public FileBasedMetaClient(SerializableConfiguration hadoopConf, ConsistencyGuardConfig consistencyGuardConfig, FileSystemRetryConfig fileSystemRetryConfig) { +this.hadoopConf = hadoopConf; +this.consistencyGuardConfig = consistencyGuardConfig; +this.fileSystemRetryConfig = fileSystemRetryConfig; + } + + public HoodieWrapperFileSystem getFs(String basePath) { +FileSystem fileSystem = FSUtils.getFs(new Path(basePath, METAFOLDER_NAME), hadoopConf.newCopy()); + +if (fileSystemRetryConfig.isFileSystemActionRetryEnable()) { + fileSystem = new HoodieRetryWrapperFileSystem(fileSystem, + fileSystemRetryConfig.getMaxRetryIntervalMs(), + fileSystemRetryConfig.getMaxRetryNumbers(), + fileSystemRetryConfig.getInitialRetryIntervalMs(), + fileSystemRetryConfig.getRetryExceptions()); +} +ValidationUtils.checkArgument(!(fileSystem instanceof HoodieWrapperFileSystem), +"File System not expected to be that of HoodieWrapperFileSystem"); +return new HoodieWrapperFileSystem(fileSystem, +consistencyGuardConfig.isConsistencyCheckEnabled() +? new FailSafeConsistencyGuard(fileSystem, consistencyGuardConfig) +: new NoOpConsistencyGuard()); + } + + public HoodieTableConfig getHoodieTableConfig(String basePath, String payloadClass) { +HoodieWrapperFileSystem fs = getFs(basePath); +Path metaPath = new Path(basePath, METAFOLDER_NAME); +TableNotFoundException.checkTableValidity(fs, new Path(basePath), metaPath); +return new HoodieTableConfig(fs, metaPath.toString(), payloadClass); + } + + public static HoodieTableConfig getHoodieTableConfig(String basePath,
[GitHub] [hudi] danny0405 commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
danny0405 commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1272208595 > @guanziyue Thanks for your positive feedback. IIUC, this improvement is effective for both Flink/Spark Streaming jobs when build `FileSystemView`. And the time saved is also considerable as @ThinkerLei mentioned above. Of course, it involve some additional memory cost. I totally agree gatekeeper's concerned especially about Flink engine, the restart cost will not be accepted when OOM. Actually in our prod cluster, we did not observe some extra OOM due to this change. Anyway, I think this is one choice for performance improvement. FYI. Thanks for the feedback, can we have some numbers about the additional memory overhead here ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272202007 ## CI report: * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN * 3342da1ce44cd5405714218b671cbf4863d2c6ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12061) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6745: Fix comment in RFC46
alexeykudinkin commented on code in PR #6745: URL: https://github.com/apache/hudi/pull/6745#discussion_r987470813 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala: ## @@ -461,6 +461,18 @@ abstract class HoodieBaseRelation(val sqlContext: SQLContext, } protected def getTableState: HoodieTableState = { +val mergerImpls = (if (optParams.contains(HoodieWriteConfig.MERGER_IMPLS.key())) { Review Comment: @wzx140 let's abstract this behind common utility in the config ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -870,18 +869,17 @@ object HoodieSparkSqlWriter { hoodieRecord }).toJavaRDD() case HoodieRecord.HoodieRecordType.SPARK => +log.debug(s"Use ${HoodieRecord.HoodieRecordType.SPARK}") Review Comment: Let's lift this log before the match so that we can tell if it's Avro or Spark ## hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java: ## @@ -151,11 +155,10 @@ protected void processNextRecord(HoodieRecord hoodieRecord) throws IOExce HoodieRecord oldRecord = records.get(key); T oldValue = oldRecord.getData(); - T combinedValue = ((HoodieRecord) recordMerger.merge(oldRecord, hoodieRecord, readerSchema, this.getPayloadProps()).get()).getData(); Review Comment: Was `getPayloadProps` change intentional? Just calling it out to make sure we can validate ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -253,21 +253,21 @@ private Option prepareRecord(HoodieRecord hoodieRecord) { } private HoodieRecord populateMetadataFields(HoodieRecord hoodieRecord, Schema schema, Properties prop) throws IOException { -Map metadataValues = new HashMap<>(); -String seqId = -HoodieRecord.generateSequenceId(instantTime, getPartitionId(), RECORD_COUNTER.getAndIncrement()); +MetadataValues metadataValues = new MetadataValues(); if (config.populateMetaFields()) { - metadataValues.put(HoodieRecord.HoodieMetadataField.FILENAME_METADATA_FIELD.getFieldName(), fileId); - metadataValues.put(HoodieRecord.HoodieMetadataField.PARTITION_PATH_METADATA_FIELD.getFieldName(), partitionPath); - metadataValues.put(HoodieRecord.HoodieMetadataField.RECORD_KEY_METADATA_FIELD.getFieldName(), hoodieRecord.getRecordKey()); - metadataValues.put(HoodieRecord.HoodieMetadataField.COMMIT_TIME_METADATA_FIELD.getFieldName(), instantTime); - metadataValues.put(HoodieRecord.HoodieMetadataField.COMMIT_SEQNO_METADATA_FIELD.getFieldName(), seqId); + String seqId = + HoodieRecord.generateSequenceId(instantTime, getPartitionId(), RECORD_COUNTER.getAndIncrement()); + metadataValues.setFileName(fileId); + metadataValues.setPartitionPath(partitionPath); + metadataValues.setRecordKey(hoodieRecord.getRecordKey()); + metadataValues.setCommitTime(instantTime); + metadataValues.setCommitSeqno(seqId); } if (config.allowOperationMetadataField()) { - metadataValues.put(HoodieRecord.HoodieMetadataField.OPERATION_METADATA_FIELD.getFieldName(), hoodieRecord.getOperation().getName()); + metadataValues.setOperation(hoodieRecord.getOperation().getName()); } -return hoodieRecord.updateValues(schema, prop, metadataValues); +return hoodieRecord.updateMetadataValues(schema, prop, metadataValues); Review Comment: Why do we need to update meta values if we're not populating them? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -295,12 +295,11 @@ object HoodieSparkSqlWriter { tblName, mapAsJavaMap(addSchemaEvolutionParameters(parameters, internalSchemaOpt) - HoodieWriteConfig.AUTO_COMMIT_ENABLE.key) )).asInstanceOf[SparkRDDWriteClient[HoodieRecordPayload[Nothing]]] val writeConfig = client.getConfig -if (writeConfig.getRecordMerger.getRecordType == HoodieRecordType.SPARK && tableType == HoodieTableType.MERGE_ON_READ && Review Comment: I think `HoodieTableType.MERGE_ON_READ` would be preferred ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/LogFileIterator.scala: ## @@ -240,17 +236,22 @@ class RecordMergingFileIterator(split: HoodieMergeOnReadFileSplit, private def merge(curRow: InternalRow, newRecord: HoodieRecord[_]): Option[InternalRow] = { // NOTE: We have to pass in Avro Schema used to read from Delta Log file since we invoke combining API // on the record from the Delta Log +val curRecord = recordMerger.getRecordType match { + case HoodieRecordType.SPARK => +new HoodieSparkRecord(curRow, baseFileReader.schema) + case _ => +new HoodieAvroIndexedRecord(serial
[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables
hudi-bot commented on PR #5958: URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272185755 ## CI report: * 0869c63d96180152d4b9a51f70d2c9d83bb95edd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12060) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #6890: [WIP] Batch clean delete files retry
nsivabalan opened a new pull request, #6890: URL: https://github.com/apache/hudi/pull/6890 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling
hudi-bot commented on PR #6889: URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272169129 ## CI report: * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12062) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6889: [HUDI-4921] Fixing last completed commit with clean scheduling
hudi-bot commented on PR #6889: URL: https://github.com/apache/hudi/pull/6889#issuecomment-1272167652 ## CI report: * 3ba1f6dedd50c01353fb77c2e50c2b0115bd2ea5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner
[ https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4921: - Labels: pull-request-available (was: ) > Fix last completed commit in CleanPlanner > - > > Key: HUDI-4921 > URL: https://issues.apache.org/jira/browse/HUDI-4921 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > Recently we added last completed commit in as part of clean commit metadata. > ideally the value should represent the last completed commit in timeline > before er which there are no inflight commits. but we just get the last > completed commit in active timeline and setting the value. > this needs fixing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan opened a new pull request, #6889: [HUDI-4921] Fixing last completed commit with clean scheduling
nsivabalan opened a new pull request, #6889: URL: https://github.com/apache/hudi/pull/6889 ### Change Logs While clean planning, we set last completed commit from the timeline. We just fetch the last completed commit, but it has to refer to last completed w/o any inflights in between. Fixing the same in this patch. This will impact only multi-writer scenarios. ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: low ** ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner
[ https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4921: -- Status: In Progress (was: Open) > Fix last completed commit in CleanPlanner > - > > Key: HUDI-4921 > URL: https://issues.apache.org/jira/browse/HUDI-4921 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.13.0 > > > Recently we added last completed commit in as part of clean commit metadata. > ideally the value should represent the last completed commit in timeline > before er which there are no inflight commits. but we just get the last > completed commit in active timeline and setting the value. > this needs fixing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4921) Fix last completed commit in CleanPlanner
[ https://issues.apache.org/jira/browse/HUDI-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4921: -- Status: Patch Available (was: In Progress) > Fix last completed commit in CleanPlanner > - > > Key: HUDI-4921 > URL: https://issues.apache.org/jira/browse/HUDI-4921 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.13.0 > > > Recently we added last completed commit in as part of clean commit metadata. > ideally the value should represent the last completed commit in timeline > before er which there are no inflight commits. but we just get the last > completed commit in active timeline and setting the value. > this needs fixing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3954) Don't keep the last commit before the earliest commit to retain
[ https://issues.apache.org/jira/browse/HUDI-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3954: -- Sprint: (was: 2022/10/04) > Don't keep the last commit before the earliest commit to retain > --- > > Key: HUDI-3954 > URL: https://issues.apache.org/jira/browse/HUDI-3954 > Project: Apache Hudi > Issue Type: Improvement > Components: cleaning >Reporter: 董可伦 >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.2 > > > Don't keep the last commit before the earliest commit to retain > According to the document of {{{}hoodie.cleaner.commits.retained{}}}: > Number of commits to retain, without cleaning. This will be retained for > num_of_commits * time_between_commits (scheduled). This also directly > translates into how much data retention the table supports for incremental > queries. > > We only need to keep the number of commit configured through parameters > {{{}hoodie.cleaner.commits.retained{}}}. > And the commit retained by clean is completed.This ensures that “This will be > retained for num_of_commits * time_between_commits” in the document. > So we don't need to keep the last commit before the earliest commit to > retain,If we want to keep more versions, we can increase the parameters > {{hoodie.cleaner.commits.retained}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272144595 ## CI report: * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056) * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN * 3342da1ce44cd5405714218b671cbf4863d2c6ff Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12061) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272142697 ## CI report: * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056) * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN * 3342da1ce44cd5405714218b671cbf4863d2c6ff UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables
hudi-bot commented on PR #5958: URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272142215 ## CI report: * 00eefd74074b2e0e04dc308ab9b775e09ed7803b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12035) * 0869c63d96180152d4b9a51f70d2c9d83bb95edd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12060) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272140645 ## CI report: * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056) * 520496abf5f71acf20b6aa06b68cdc8dd84d344c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5958: [HUDI-3900] [UBER] Support log compaction action for MOR tables
hudi-bot commented on PR #5958: URL: https://github.com/apache/hudi/pull/5958#issuecomment-1272140065 ## CI report: * 00eefd74074b2e0e04dc308ab9b775e09ed7803b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12035) * 0869c63d96180152d4b9a51f70d2c9d83bb95edd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272138058 ## CI report: * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12057) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] 01/01: [HUDI-4905] Improve type handling in proto schema conversion
This is an automated email from the ASF dual-hosted git repository. akudinkin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git commit 5d2c2853ea37ca8268f0d049460bee026216 Merge: 06d924137b 7d5b9dc0a9 Author: Alexey Kudinkin AuthorDate: Fri Oct 7 15:32:38 2022 -0700 [HUDI-4905] Improve type handling in proto schema conversion hudi-utilities/pom.xml | 1 - .../schema/ProtoClassBasedSchemaProvider.java | 19 +- .../sources/helpers/ProtoConversionUtil.java | 242 ++--- .../schema/TestProtoClassBasedSchemaProvider.java | 21 +- .../utilities/sources/TestProtoKafkaSource.java| 2 +- .../sources/helpers/TestProtoConversionUtil.java | 100 +++-- .../schema-provider/proto/oneof_schema.avsc| 42 .../resources/schema-provider/proto/sample.proto | 8 + ..._flattened.avsc => sample_schema_defaults.avsc} | 31 ++- ...le_schema_wrapped_and_timestamp_as_record.avsc} | 16 +- 10 files changed, 357 insertions(+), 125 deletions(-)
[hudi] branch master updated (06d924137b -> 5d2c2853ea)
This is an automated email from the ASF dual-hosted git repository. akudinkin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 06d924137b [HUDI-2786] Docker demo on mac aarch64 (#6859) add 9c1fa14fd6 add support for unraveling proto schemas add 510d525e15 fix some compile issues add aad9ec1320 naming and style updates add 889927 make test data random, reuse code add a922a5beca add test for 2 different recursion depths, fix schema cache key add 3b37dc95d9 add unsigned long support add 706291d4f3 better handle other types add c28e874fca rebase on 4904 add 190cc16381 get all tests working add f18fff886e fix oneof expected schema, update tests after rebase add ff5baa8706 revert scala binary change add 0069da2d1a try a different method to avoid avro version add 71a39bf488 Merge remote-tracking branch 'origin/master' into HUDI-4905 add c5dff63375 delete unused file add f53d47ea3b address PR feedback, update decimal precision add 1831639e39 fix isNullable issue, check if class is Int64value add eca2992d65 checkstyle fix add 423da6f7bb change wrapper descriptor set initialization add fb2d9f0030 add in testing for unsigned long to BigInteger conversion add f03f9610cf shade protobuf dependency add 57f8b81194 Merge remote-tracking branch 'origin/master' into HUDI-4905 add 7d5b9dc0a9 Revert "shade protobuf dependency" new 5d2c2853ea [HUDI-4905] Improve type handling in proto schema conversion The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: hudi-utilities/pom.xml | 1 - .../schema/ProtoClassBasedSchemaProvider.java | 19 +- .../sources/helpers/ProtoConversionUtil.java | 242 ++--- .../schema/TestProtoClassBasedSchemaProvider.java | 21 +- .../utilities/sources/TestProtoKafkaSource.java| 2 +- .../sources/helpers/TestProtoConversionUtil.java | 100 +++-- .../schema-provider/proto/oneof_schema.avsc| 42 .../resources/schema-provider/proto/sample.proto | 8 + ..._flattened.avsc => sample_schema_defaults.avsc} | 31 ++- ...le_schema_wrapped_and_timestamp_as_record.avsc} | 16 +- 10 files changed, 357 insertions(+), 125 deletions(-) create mode 100644 hudi-utilities/src/test/resources/schema-provider/proto/oneof_schema.avsc rename hudi-utilities/src/test/resources/schema-provider/proto/{sample_schema_flattened.avsc => sample_schema_defaults.avsc} (92%) rename hudi-utilities/src/test/resources/schema-provider/proto/{sample_schema_nested.avsc => sample_schema_wrapped_and_timestamp_as_record.avsc} (95%)
[GitHub] [hudi] alexeykudinkin merged pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion
alexeykudinkin merged PR #6806: URL: https://github.com/apache/hudi/pull/6806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1272095799 ## CI report: * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272095614 ## CI report: * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017) * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12057) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6887: [MINOR][DOCS] Fix docker_demo.md for 0.12.0
nsivabalan commented on PR #6887: URL: https://github.com/apache/hudi/pull/6887#issuecomment-1272085515 this is already addressed in https://github.com/apache/hudi/pull/6860 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4976) Update docker demo website page to explain how to run on m1 macs
[ https://issues.apache.org/jira/browse/HUDI-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4976: - Labels: pull-request-available (was: ) > Update docker demo website page to explain how to run on m1 macs > > > Key: HUDI-4976 > URL: https://issues.apache.org/jira/browse/HUDI-4976 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Update the website to reflect the changes done in the HUDI-2786 fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan merged pull request #6860: [HUDI-4976] added m1 changes to the site
nsivabalan merged PR #6860: URL: https://github.com/apache/hudi/pull/6860 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-4976] added m1 changes to the site (#6860)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new b662955f68 [HUDI-4976] added m1 changes to the site (#6860) b662955f68 is described below commit b662955f6885608961b0e8a83e1b841506087b88 Author: Jon Vexler AuthorDate: Fri Oct 7 17:03:53 2022 -0400 [HUDI-4976] added m1 changes to the site (#6860) --- website/docs/docker_demo.md| 63 +- .../versioned_docs/version-0.12.0/docker_demo.md | 76 +++--- 2 files changed, 128 insertions(+), 11 deletions(-) diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md index 698aec5439..681b1be51a 100644 --- a/website/docs/docker_demo.md +++ b/website/docs/docker_demo.md @@ -4,6 +4,8 @@ keywords: [ hudi, docker, demo] toc: true last_modified_at: 2019-12-30T15:59:57-04:00 --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; ## A Demo using Docker containers @@ -58,6 +60,15 @@ The next step is to run the Docker compose script and setup configs for bringing This should pull the Docker images from Docker hub and setup the Docker cluster. + + + ```java cd docker ./setup_demo.sh @@ -118,6 +129,50 @@ Copying spark default config and setting up configs $ docker ps ``` + + +Please note that Presto and Trino do not currently work for the docker demo on Mac AArch64 + +```java +cd docker +./setup_demo.sh --mac-aarch64 +... +.. +[+] Running 12/12 +⠿ adhoc-1 Pulled 2.9s +⠿ spark-worker-1 Pulled 3.0s +⠿ kafka Pulled2.9s +⠿ datanode1 Pulled2.9s +⠿ hivemetastore Pulled2.9s +⠿ hiveserver Pulled 3.0s +⠿ hive-metastore-postgresql Pulled2.8s +⠿ namenode Pulled 2.9s +⠿ sparkmaster Pulled 2.9s +⠿ zookeeper Pulled2.8s +⠿ adhoc-2 Pulled 2.9s +⠿ historyserver Pulled2.9s +[+] Running 12/12 +⠿ Container zookeeper Started 41.0s +⠿ Container kafkabrokerStarted 41.7s +⠿ Container hive-metastore-postgresql Running0.0s +⠿ Container namenode Running0.0s +⠿ Container hivemetastore Running0.0s +⠿ Container historyserver Started 41.0s +⠿ Container datanode1 Started 49.9s +⠿ Container hiveserver Running0.0s +⠿ Container sparkmasterStarted 41.9s +⠿ Container spark-worker-1 Started 50.2s +⠿ Container adhoc-2Started 38.5s +⠿ Container adhoc-1Started 38.5s +Copying spark default config and setting up configs +Copying spark default config and setting up configs +$ docker ps +``` + + + + At this point, the Docker cluster will be up and running. The demo cluster brings up the following services * HDFS Services (NameNode, DataNode) @@ -140,7 +195,9 @@ The batches are windowed intentionally so that the second batch contains updates ### Step 1 : Publish the first batch to Kafka -Upload the first batch to Kafka topic 'stock ticks' `cat docker/demo/data/batch_1.json | kcat -b kafkabroker -t stock_ticks -P` +Upload the first batch to Kafka topic 'stock ticks' + +`cat docker/demo/data/batch_1.json | kcat -b kafkabroker -t stock_ticks -P` To check if the new topic shows up, use ```java @@ -1137,7 +1194,7 @@ Compaction successfully completed for 20180924070031 # Now refresh and check again. You will see that there is a new compaction requested -hoodie:stock_ticks->refresh +hoodie:stock_ticks_mor->refresh 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from /user/hive/warehouse/stock_ticks_mor 18/09/24 07:01:16 INFO table.HoodieTableConfig: Loading table properties from /user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties 18/09/24 07:01:16 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor @@ -1163,7 +1220,7 @@ hoodie:stock_ticks_mor->refresh 18/09/24 07:03:00 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1) from /user/hive/warehouse/stock_ticks_mor Metadata for table stock_ticks_mor loaded -hoodie:stock_ticks->compactions show all +hoodie:stock_ticks_mor->compactions show all 18/09/24 07:03:15 INFO timeline.HoodieActiveTimeline: Loaded instants
[jira] [Updated] (HUDI-2786) Failed to connect to namenode in Docker Demo on Apple M1 chip
[ https://issues.apache.org/jira/browse/HUDI-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2786: - Labels: pull-request-available (was: ) > Failed to connect to namenode in Docker Demo on Apple M1 chip > - > > Key: HUDI-2786 > URL: https://issues.apache.org/jira/browse/HUDI-2786 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies, dev-experience >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > {code:java} > > ./setup_demo.sh > [+] Running 1/0 > ⠿ compose Warning: No resource found to remove > > 0.0s > [+] Running 15/15 > ⠿ namenode Pulled > > 1.4s > ⠿ kafka Pulled > > 1.3s > ⠿ presto-worker-1 Pulled > > 1.3s > ⠿ historyserver Pulled > > 1.4s > ⠿ adhoc-2 Pulled > > 1.3s > ⠿ adhoc-1 Pulled > > 1.4s > ⠿ graphite Pulled > > 1.3s > ⠿ sparkmaster Pulled > > 1.3s > ⠿ hive-metastore-postgresql Pulled > > 1.3s > ⠿ presto-coordinator-1 Pulled > > 1.3s > ⠿ spark-worker-1 Pulled > > 1.4s > ⠿ hiveserver Pulled > > 1.3s > ⠿ hivemetastore Pulled > > 1.4s > ⠿ zookeeper Pulled > > 1.3s > ⠿ datanode1 Pulled > > 1.3s > [+] Running 16/16 > ⠿ Network compose_default Created > > 0.0s > ⠿ Container hive-metastore-postgresql Started > > 1.1s > ⠿ Container kafkabroker Started > > 1.1s > ⠿ Container zookeeper Started >
[hudi] branch master updated: [HUDI-2786] Docker demo on mac aarch64 (#6859)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 06d924137b [HUDI-2786] Docker demo on mac aarch64 (#6859) 06d924137b is described below commit 06d924137bbf216864ee4fa09018b325c8b0a636 Author: Jon Vexler AuthorDate: Fri Oct 7 17:02:09 2022 -0400 [HUDI-2786] Docker demo on mac aarch64 (#6859) --- ...pose_hadoop284_hive233_spark244_mac_aarch64.yml | 259 + docker/setup_demo.sh | 10 +- docker/stop_demo.sh| 7 +- 3 files changed, 272 insertions(+), 4 deletions(-) diff --git a/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml b/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml new file mode 100644 index 00..857180cfbe --- /dev/null +++ b/docker/compose/docker-compose_hadoop284_hive233_spark244_mac_aarch64.yml @@ -0,0 +1,259 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +version: "3.3" + +services: + + namenode: +image: apachehudi/hudi-hadoop_2.8.4-namenode:linux-arm64-0.10.1 +platform: linux/arm64 +hostname: namenode +container_name: namenode +environment: + - CLUSTER_NAME=hudi_hadoop284_hive232_spark244 +ports: + - "50070:50070" + - "8020:8020" + # JVM debugging port (will be mapped to a random port on host) + - "5005" +env_file: + - ./hadoop.env +healthcheck: + test: [ "CMD", "curl", "-f", "http://namenode:50070"; ] + interval: 30s + timeout: 10s + retries: 3 + + datanode1: +image: apachehudi/hudi-hadoop_2.8.4-datanode:linux-arm64-0.10.1 +platform: linux/arm64 +container_name: datanode1 +hostname: datanode1 +environment: + - CLUSTER_NAME=hudi_hadoop284_hive232_spark244 +env_file: + - ./hadoop.env +ports: + - "50075:50075" + - "50010:50010" + # JVM debugging port (will be mapped to a random port on host) + - "5005" +links: + - "namenode" + - "historyserver" +healthcheck: + test: [ "CMD", "curl", "-f", "http://datanode1:50075"; ] + interval: 30s + timeout: 10s + retries: 3 +depends_on: + - namenode + + historyserver: +image: apachehudi/hudi-hadoop_2.8.4-history:latest +hostname: historyserver +container_name: historyserver +environment: + - CLUSTER_NAME=hudi_hadoop284_hive232_spark244 +depends_on: + - "namenode" +links: + - "namenode" +ports: + - "58188:8188" +healthcheck: + test: [ "CMD", "curl", "-f", "http://historyserver:8188"; ] + interval: 30s + timeout: 10s + retries: 3 +env_file: + - ./hadoop.env +volumes: + - historyserver:/hadoop/yarn/timeline + + hive-metastore-postgresql: +image: menorah84/hive-metastore-postgresql:2.3.0 +platform: linux/arm64 +environment: + - POSTGRES_HOST_AUTH_METHOD=trust +volumes: + - hive-metastore-postgresql:/var/lib/postgresql +hostname: hive-metastore-postgresql +container_name: hive-metastore-postgresql + + hivemetastore: +image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:linux-arm64-0.10.1 +platform: linux/arm64 +hostname: hivemetastore +container_name: hivemetastore +links: + - "hive-metastore-postgresql" + - "namenode" +env_file: + - ./hadoop.env +command: /opt/hive/bin/hive --service metastore +environment: + SERVICE_PRECONDITION: "namenode:50070 hive-metastore-postgresql:5432" +ports: + - "9083:9083" + # JVM debugging port (will be mapped to a random port on host) + - "5005" +healthcheck: + test: [ "CMD", "nc", "-z", "hivemetastore", "9083" ] + interval: 30s + timeout: 10s + retries: 3 +depends_on: + - "hive-metastore-postgresql" + - "namenode" + + hiveserver: +image: apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:linux-arm64-0.10.1 +platform: linux/arm64 +hostname: hiveserver +container_name: hiveserver +env_file: + - ./hadoop.env +environment:
[GitHub] [hudi] nsivabalan merged pull request #6859: [HUDI-2786] Docker demo on mac aarch64
nsivabalan merged PR #6859: URL: https://github.com/apache/hudi/pull/6859 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (a51181726c -> c5125d38b5)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from a51181726c [HUDI-4992] Fixing invalid min/max record key stats in Parquet metadata (#6883) add c5125d38b5 [HUDI-4972] Fixes to make unit tests work on m1 mac (#6751) No new revisions were added by this update. Summary of changes: hudi-examples/hudi-examples-java/pom.xml | 6 ++ hudi-timeline-service/pom.xml| 5 + pom.xml | 34 +++- 3 files changed, 44 insertions(+), 1 deletion(-)
[GitHub] [hudi] nsivabalan merged pull request #6751: [HUDI-4972] Fixes to make unit tests work on m1
nsivabalan merged PR #6751: URL: https://github.com/apache/hudi/pull/6751 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1272058386 ## CI report: * e246d65957362860b850f1af9ef973b85bf1a4eb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12017) * d7fbaa4fed0c713ee0b0a8ba4b8900b11b89c433 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion
hudi-bot commented on PR #6806: URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271997791 ## CI report: * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12054) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271988433 ## CI report: * 1ad6147e60eb5247411ad3965106891139c47148 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055) * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12056) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271977601 ## CI report: * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053) * 1ad6147e60eb5247411ad3965106891139c47148 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055) * e66a9361f23cdaea6826f3948eaaa14aa3d4bff0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271870496 ## CI report: * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053) * 1ad6147e60eb5247411ad3965106891139c47148 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12055) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6885: [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool
nsivabalan commented on PR #6885: URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271866369 @pramodbiligiri : there is some failure in github actions. can you check on that please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271865581 ## CI report: * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053) * 1ad6147e60eb5247411ad3965106891139c47148 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
hudi-bot commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271863767 ## CI report: * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12052) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4972) Some unit tests cannot run on mac m1
[ https://issues.apache.org/jira/browse/HUDI-4972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4972: - Labels: pull-request-available (was: ) > Some unit tests cannot run on mac m1 > > > Key: HUDI-4972 > URL: https://issues.apache.org/jira/browse/HUDI-4972 > Project: Apache Hudi > Issue Type: Improvement > Components: dependencies >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > org.xerial.snappy is not compatible with m1 macs before version 1.1.8.2. > Additionally Spark is not compatible with m1 macs before version. 2.4.8 and > rocksdb-jini is not compatible before version 6.29.4.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6751: [HUDI-4972] Fixes to make unit tests work on m1
nsivabalan commented on code in PR #6751: URL: https://github.com/apache/hudi/pull/6751#discussion_r990326965 ## packaging/hudi-utilities-bundle/pom.xml: ## @@ -376,6 +376,12 @@ org.apache.parquet parquet-avro compile + + + org.xerial.snappy + snappy-java Review Comment: we test this and things are good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion
hudi-bot commented on PR #6806: URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271803780 ## CI report: * 57f8b811946c7c013cbc31e9bd8db469f70cda2a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12036) * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12054) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271799001 ## CI report: * ed246a4e7ff10fe5c2ba9720823873441e1b5831 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12053) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion
hudi-bot commented on PR #6806: URL: https://github.com/apache/hudi/pull/6806#issuecomment-1271798724 ## CI report: * 57f8b811946c7c013cbc31e9bd8db469f70cda2a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12036) * 7d5b9dc0a9334b96adcdb8e17964f31944e94e91 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects
[ https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4995: Priority: Minor (was: Major) > Depency conflicts on apache http with other projects > > > Key: HUDI-4995 > URL: https://issues.apache.org/jira/browse/HUDI-4995 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Minor > Fix For: 0.12.1 > > > Hudi imports org.apache.http which can collide with other libs such > elasticsearch client. This makes the spark-bundle create conflicts when use > both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4995) Depency conflicts on apache http with other projects
[ https://issues.apache.org/jira/browse/HUDI-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nicolas paris updated HUDI-4995: Fix Version/s: 0.12.1 > Depency conflicts on apache http with other projects > > > Key: HUDI-4995 > URL: https://issues.apache.org/jira/browse/HUDI-4995 > Project: Apache Hudi > Issue Type: Improvement >Reporter: nicolas paris >Priority: Major > Fix For: 0.12.1 > > > Hudi imports org.apache.http which can collide with other libs such > elasticsearch client. This makes the spark-bundle create conflicts when use > both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6888: [HUDI-4982] [DO NOT MERGE] [DO NOT REVIEW] add spark bundle tests to github actions
hudi-bot commented on PR #6888: URL: https://github.com/apache/hudi/pull/6888#issuecomment-1271793283 ## CI report: * ed246a4e7ff10fe5c2ba9720823873441e1b5831 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271791942 ## CI report: * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12051) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4995) Depency conflicts on apache http with other projects
nicolas paris created HUDI-4995: --- Summary: Depency conflicts on apache http with other projects Key: HUDI-4995 URL: https://issues.apache.org/jira/browse/HUDI-4995 Project: Apache Hudi Issue Type: Improvement Reporter: nicolas paris Hudi imports org.apache.http which can collide with other libs such elasticsearch client. This makes the spark-bundle create conflicts when use both libs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6806: [HUDI-4905] Improve type handling in proto schema conversion
alexeykudinkin commented on code in PR #6806: URL: https://github.com/apache/hudi/pull/6806#discussion_r990227333 ## hudi-utilities/pom.xml: ## @@ -85,7 +85,6 @@ com.google.protobuf protobuf-java-util - test Review Comment: Yes, let's keep changes bounded -- shading is the right change but there's no reason to take it in this PR, since we're adding Proto deps to bundles -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4992) Spark Row-writing Bulk Insert produces incorrect Bloom Filter metadata
[ https://issues.apache.org/jira/browse/HUDI-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-4992. Resolution: Fixed > Spark Row-writing Bulk Insert produces incorrect Bloom Filter metadata > -- > > Key: HUDI-4992 > URL: https://issues.apache.org/jira/browse/HUDI-4992 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > Troubleshooting duplicates issue w/ Abhishek Modi from Notion, we've found > that the min/max record key stats are being currently persisted incorrectly > into Parquet metadata, leading to duplicate records being produced in their > pipeline after initial bulk-insert. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4472) Revisit schema handling in HoodieSparkSqlWriter
[ https://issues.apache.org/jira/browse/HUDI-4472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4472: - Reviewers: Ethan Guo, Raymond Xu, sivabalan narayanan > Revisit schema handling in HoodieSparkSqlWriter > --- > > Key: HUDI-4472 > URL: https://issues.apache.org/jira/browse/HUDI-4472 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > After many features aimed to bring more and more sophisticated support of > schema evolution were layered in w/in HoodieSparkSqlWriter, it's currently > requiring careful attention to reconcile many flows and make sure that the > original invariants still hold. > > One example of the issue was discovered while addressing HUDI-4081 (which was > duct-typed in [#6213|https://github.com/apache/hudi/pull/6213/files#] to > avoid substantial changes before the release) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xushiyan commented on issue #6692: [SUPPORT] ClassCastException after migration to Hudi 0.12.0
xushiyan commented on issue #6692: URL: https://github.com/apache/hudi/issues/6692#issuecomment-1271738970 i think the issue originated from hadoop-mr-bundle shaded `org.apache.avro.*` package. @eshu can you list out all the jars you add to the classpath? specifically the hudi jars -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4982) Make bundle combination testing covered in CI
[ https://issues.apache.org/jira/browse/HUDI-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4982: - Labels: pull-request-available (was: ) > Make bundle combination testing covered in CI > - > > Key: HUDI-4982 > URL: https://issues.apache.org/jira/browse/HUDI-4982 > Project: Apache Hudi > Issue Type: Test >Reporter: Raymond Xu >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.12.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #6888: [HUDI-4982] [DO NOT MERGE] add spark bundle tests to github actions
jonvex opened a new pull request, #6888: URL: https://github.com/apache/hudi/pull/6888 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4975) datahub sync bundle causes class loading issue
[ https://issues.apache.org/jira/browse/HUDI-4975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614121#comment-17614121 ] Raymond Xu commented on HUDI-4975: -- Root-caused the issue: when using datahub-sync-bundle built with spark3.3 profile, it's expecting to work with {code} https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java {code} which does not exist in parquet 1.10.1, which is used by spark 3.1 This means datahub-sync-bundle (and possibly other sync bundles) are not fully decouple from spark profiles. This can be mitigated by putting spark-bundle first in the classpath but we should eliminate the root issue. > datahub sync bundle causes class loading issue > -- > > Key: HUDI-4975 > URL: https://issues.apache.org/jira/browse/HUDI-4975 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Critical > Fix For: 0.12.2 > > > run utilities-slim.jar as the main jar for deltastreamer > set --jars > /tmp/hudi-datahub-sync-bundle-0.12.1-rc1.jar,/tmp/hudi-spark3.1-bundle_2.12-0.12.1-rc1.jar > put datahub sync bundle before spark bundle resulted in class loader issue. > works fine if spark bundle goes first > {code:bash} > Caused by: java.lang.NoClassDefFoundError: > org/apache/parquet/schema/LogicalTypeAnnotation > at > org.apache.hudi.io.storage.HoodieFileWriterFactory.newParquetFileWriter(HoodieFileWriterFactory.java:78) > at > org.apache.hudi.io.storage.HoodieFileWriterFactory.newParquetFileWriter(HoodieFileWriterFactory.java:70) > at > org.apache.hudi.io.storage.HoodieFileWriterFactory.getFileWriter(HoodieFileWriterFactory.java:54) > at > org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:104) > at > org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:76) > at > org.apache.hudi.io.CreateHandleFactory.create(CreateHandleFactory.java:46) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:83) > at > org.apache.hudi.execution.CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteInsertHandler.java:40) > at > org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37) > at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:135) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.lang.ClassNotFoundException: > org.apache.parquet.schema.LogicalTypeAnnotation > at java.net.URLClassLoader.findClass(URLClassLoader.java:387) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > ... 14 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6886: [HUDI-4994] Fix bug that prevents re-ingestion of soft-deleted Datahub entities
hudi-bot commented on PR #6886: URL: https://github.com/apache/hudi/pull/6886#issuecomment-1271727870 ## CI report: * 832c98a95c9482d24389ee2f2052893097bfdeda Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12049) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
zhangyue19921010 commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271721244 Hi @alexeykudinkin and @nsivabalan, Really appreciate for your review and comments! Also sorry for taking a few days to finish these comments. Please take a look at your convince :) If there are any omissions or other things that need to be modified, please feel free to let me know! Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6885: [HUDI-4993] Make DataPlatform name and Dataset env configurable in DatahubSyncTool
hudi-bot commented on PR #6885: URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271715586 ## CI report: * 0684a683ea55e513e8c896e5748270a9e5953d44 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12048) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
hudi-bot commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271713313 ## CI report: * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12052) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhangyue19921010 commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
zhangyue19921010 commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271688530 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271623904 ## CI report: * 730f5b91c206267dc89e471732f56679424c7391 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050) * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12051) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271616397 ## CI report: * e391adad3fa33145e8814160107d2afbcb450597 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045) * 730f5b91c206267dc89e471732f56679424c7391 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050) * d18a40d00cb6ff6c2ff2768b289c1435e3ceaa28 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
hudi-bot commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271613867 ## CI report: * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4994) DatahubSyncTool does not correctly re-ingest soft-deleted entities
[ https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pramod Biligiri updated HUDI-4994: -- Description: Datahub has a notion of soft-deletes (the entity still exists in the database with a status=removed:true). Such entities could get re-ingested with new properties at a later time, such that the older one gets overwritten. The current implementation in DatahubSyncTool does not handle this scenario. It fails to update the status flag to removed:false during ingest, which means the entity won't surface in the Datahub UI at all. Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default] was: When DatahubSyncTool updates an entity in Datahub using an UPSERT request of their RestEmiiter client, it can be assumed that the entity is no longer considered deleted, and needs to be discoverable henceforth in the Datahub UI. For that, it is necessary to explicitly set the "status" metadata aspect of the entity to "\{'removed':false}". This will handle the situation where the entity may have been (soft) deleted in the past. The addition of this "removed:false" for "status" aspect has no impact on newly created entities, or hard-deleted entities (of which no trace remains anyway). Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default Summary: DatahubSyncTool does not correctly re-ingest soft-deleted entities (was: DatahubSyncTool should set "removed" status of an entity to false when updating it) > DatahubSyncTool does not correctly re-ingest soft-deleted entities > -- > > Key: HUDI-4994 > URL: https://issues.apache.org/jira/browse/HUDI-4994 > Project: Apache Hudi > Issue Type: Task > Components: meta-sync >Reporter: Pramod Biligiri >Priority: Major > Labels: pull-request-available > > Datahub has a notion of soft-deletes (the entity still exists in the database > with a status=removed:true). Such entities could get re-ingested with new > properties at a later time, such that the older one gets overwritten. The > current implementation in DatahubSyncTool does not handle this scenario. It > fails to update the status flag to removed:false during ingest, which means > the entity won't surface in the Datahub UI at all. > Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: > [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271542716 ## CI report: * e391adad3fa33145e8814160107d2afbcb450597 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045) * 730f5b91c206267dc89e471732f56679424c7391 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12050) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271536505 ## CI report: * e391adad3fa33145e8814160107d2afbcb450597 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045) * 730f5b91c206267dc89e471732f56679424c7391 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1271530268 ## CI report: * e391adad3fa33145e8814160107d2afbcb450597 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan opened a new pull request, #6887: [MINOR][DOCS] Fix docker_demo.md for 0.12.0
xushiyan opened a new pull request, #6887: URL: https://github.com/apache/hudi/pull/6887 ### Change Logs Update 0.12.0 docs. ### Impact **Risk level: none** ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6875: [SUPPORT] - [Docker Demo] - Partition key parts [dt] does not match with partition values [2018, 08, 31]
xushiyan commented on issue #6875: URL: https://github.com/apache/hudi/issues/6875#issuecomment-1271502980 @pavimotorq i think the website is not updated somehow. In the latest docs https://hudi.apache.org/docs/next/docker_demo we have updated the commands to have ` --partition-value-extractor org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`. Please follow that guide. cc @bhasudha we may need to backfill this change for 0.12.0 version of docs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on issue #6881: Processing time is increased with hudi metadata enable
xushiyan commented on issue #6881: URL: https://github.com/apache/hudi/issues/6881#issuecomment-1271484448 @koochiswathiTR is the processing time of 15min for every commit after enabling metadata table? please also provide hudi/spark versions, workload info like how many records each commit and how many are updates. And environment info as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6886: [HUDI-4994] Adding undo of soft-delete to upsert code flow
hudi-bot commented on PR #6886: URL: https://github.com/apache/hudi/pull/6886#issuecomment-1271468381 ## CI report: * 832c98a95c9482d24389ee2f2052893097bfdeda Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12049) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6885: [HUDI-4993] Making DataPlatform name and Dataset env configurable
hudi-bot commented on PR #6885: URL: https://github.com/apache/hudi/pull/6885#issuecomment-1271468346 ## CI report: * 0684a683ea55e513e8c896e5748270a9e5953d44 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12048) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
hudi-bot commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1271466614 ## CI report: * b838e1f406902c9bdfb5e84d53ef5a5effd0765b UNKNOWN * 6114ee2aa59f087e5ef0b1b53979eec143b33f5e UNKNOWN * 92760dbf5a047fe1f9941fa4b36c944eb3bec5c7 UNKNOWN * 4587303118918c5e56ecb10732d9fcba43a90ee7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12044) * 4ba91d4ce8345b4917e1f402694a55d07bf2951c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12047) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [MINOR][DOCS] remove validation code from quickstart examples (#6879)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new e013d654da [MINOR][DOCS] remove validation code from quickstart examples (#6879) e013d654da is described below commit e013d654da67fedca5a24026ab93e8fc34f5b3a8 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Fri Oct 7 19:22:42 2022 +0800 [MINOR][DOCS] remove validation code from quickstart examples (#6879) --- website/docs/quick-start-guide.md | 40 --- 1 file changed, 40 deletions(-) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 4beb9577c5..ed7bb29698 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -189,7 +189,6 @@ import org.apache.hudi.common.model.HoodieRecord val tableName = "hudi_trips_cow" val basePath = "file:///tmp/hudi_trips_cow" val dataGen = new DataGenerator -val snapshotQuery = "SELECT begin_lat, begin_lon, driver, end_lat, end_lon, fare, partitionpath, rider, ts, uuid FROM hudi_ro_table" ``` @@ -200,7 +199,6 @@ val snapshotQuery = "SELECT begin_lat, begin_lon, driver, end_lat, end_lon, fare tableName = "hudi_trips_cow" basePath = "file:///tmp/hudi_trips_cow" dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() -snapshotQuery = "SELECT begin_lat, begin_lon, driver, end_lat, end_lon, fare, partitionpath, rider, ts, uuid FROM hudi_ro_table" ``` @@ -429,9 +427,6 @@ df.write.format("hudi"). option(TABLE_NAME, tableName). mode(Overwrite). save(basePath) - -// validations -assert(df.except(spark.sql(snapshotQuery)).count() == 0) ``` :::info `mode(Overwrite)` overwrites and recreates the table if it already exists. @@ -468,9 +463,6 @@ df.write.format("hudi"). \ options(**hudi_options). \ mode("overwrite"). \ save(basePath) - -# validations -assert spark.sql(snapshotQuery).exceptAll(df).count() == 0 ``` :::info `mode(Overwrite)` overwrites and recreates the table if it already exists. @@ -713,7 +705,6 @@ values={[ ```scala // spark-shell -val snapBeforeUpdate = spark.sql(snapshotQuery) val updates = convertToStringList(dataGen.generateUpdates(10)) val df = spark.read.json(spark.sparkContext.parallelize(updates, 2)) df.write.format("hudi"). @@ -724,10 +715,6 @@ df.write.format("hudi"). option(TABLE_NAME, tableName). mode(Append). save(basePath) - -// validations -assert(spark.sql(snapshotQuery).intersect(df).count() == df.count()) -assert(spark.sql(snapshotQuery).except(df).except(snapBeforeUpdate).count() == 0) ``` :::note Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time. @@ -816,17 +803,12 @@ when not matched then ```python # pyspark -snapshotBeforeUpdate = spark.sql(snapshotQuery) updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10)) df = spark.read.json(spark.sparkContext.parallelize(updates, 2)) df.write.format("hudi"). \ options(**hudi_options). \ mode("append"). \ save(basePath) - -# validations -assert spark.sql(snapshotQuery).intersect(df).count() == df.count() -assert spark.sql(snapshotQuery).exceptAll(snapshotBeforeUpdate).exceptAll(df).count() == 0 ``` :::note Notice that the save mode is now `Append`. In general, always use append mode unless you are trying to create the table for the first time. @@ -1122,7 +1104,6 @@ Delete records for the HoodieKeys passed in. ```scala // spark-shell -val snapshotBeforeDelete = spark.sql(snapshotQuery) // fetch total records count spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() // fetch two records to be deleted @@ -1151,10 +1132,6 @@ val roAfterDeleteViewDF = spark. roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot") // fetch should return (total - 2) records spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() - -// validations -assert(spark.sql("select uuid, partitionpath, ts from hudi_trips_snapshot").intersect(hardDeleteDf).count() == 0) -assert(snapshotBeforeDelete.except(spark.sql("select uuid, partitionpath, ts from hudi_trips_snapshot")).except(snapshotBeforeDelete).count() == 0) ``` :::note Only `Append` mode is supported for delete operation. @@ -1182,7 +1159,6 @@ Delete records for the HoodieKeys passed in. ```python # pyspark -snapshotBeforeDelete = spark.sql(snapshotQuery) # fetch total records count spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() # fetch two records to be deleted @@ -1216,10 +1192,6 @@ roAfterDeleteViewDF = spark. \ roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot") # fetch should return (total - 2) records spark.sql("select uuid, partitionpath from hudi_