[PR] [MINOR] add logger to CompactionPlanOperator & ClusteringPlanOperator [hudi]
eric9204 opened a new pull request, #10562: URL: https://github.com/apache/hudi/pull/10562 ### Change Logs None ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7218] Integrate new HFile reader with file reader factory [hudi]
nsivabalan commented on code in PR #10330: URL: https://github.com/apache/hudi/pull/10330#discussion_r1465901347 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java: ## @@ -29,6 +29,13 @@ groupName = ConfigGroups.Names.READER, description = "Configurations that control file group reading.") public class HoodieReaderConfig extends HoodieConfig { + public static final ConfigProperty USE_BUILT_IN_HFILE_READER = ConfigProperty + .key("hoodie.hfile.use.built.in.reader") Review Comment: is this meant to be deprecated in near future and it not really expected to be used by end user? then should we consider prefixing with "_" ## hudi-common/src/main/java/org/apache/hudi/common/bloom/HoodieDynamicBoundedBloomFilter.java: ## @@ -64,14 +66,17 @@ public class HoodieDynamicBoundedBloomFilter implements BloomFilter { public HoodieDynamicBoundedBloomFilter(String serString) { // ignoring the type code for now, since we have just one version byte[] bytes = Base64CodecUtil.decode(serString); -DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytes)); -try { - internalDynamicBloomFilter = new InternalDynamicBloomFilter(); - internalDynamicBloomFilter.readFields(dis); - dis.close(); -} catch (IOException e) { - throw new HoodieIndexException("Could not deserialize BloomFilter instance", e); -} +extractAndSetInternalBloomFilter(new DataInputStream(new ByteArrayInputStream(bytes))); Review Comment: is it possible to do try with resource design here ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -211,9 +233,10 @@ protected ClosableIterator> lookupRecords(List sorte blockContentLoc.getContentPositionInLogFile(), blockContentLoc.getBlockSize()); -try (final HoodieAvroHFileReader reader = - new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf), - Option.of(getSchemaFromHeader( { +try (final BaseHoodieAvroHFileReader reader = (BaseHoodieAvroHFileReader) Review Comment: can we try to thinkg of better naming for HoodieAvroFileReaderBase and BaseHoodieAvroHFileReader? do you think we can rename BaseHoodieAvroHFileReader to HoodieAvroHFileReaderImplBase or HoodieAvroHFileReaderBaseImpl ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java: ## @@ -728,42 +464,100 @@ public IndexedRecord next() { @Override public void close() { try { -scanner.close(); reader.close(); } catch (IOException e) { -throw new HoodieIOException("Error closing the hfile reader and scanner", e); +throw new HoodieIOException("Error closing the HFile reader and scanner", e); } } - } - static class SeekableByteArrayInputStream extends ByteBufferBackedInputStream implements Seekable, PositionedReadable { -public SeekableByteArrayInputStream(byte[] buf) { - super(buf); -} +private static Iterator getRecordByKeyPrefixIteratorInternal(HFileReader reader, Review Comment: I see lot of code duplication across HoodieAvroHBaseHFileReader and HoodieAvroHFileReader. for eg, RecordByKeyPrefixIterator, RecordByKeyIterator. can we try to fix them and re-use code ## hudi-common/src/main/java/org/apache/hudi/common/bloom/SimpleBloomFilter.java: ## @@ -138,4 +144,12 @@ public BloomFilterTypeCode getBloomFilterTypeCode() { return BloomFilterTypeCode.SIMPLE; } + private void extractAndSetInternalBloomFilter(DataInputStream dis) { +try { + this.filter.readFields(dis); + dis.close(); Review Comment: same comment as above ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -182,7 +183,7 @@ public static List> filterKeysFromFile(Path filePath, List> foundRecordKeys = new ArrayList<>(); try (HoodieFileReader fileReader = HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO) -.getFileReader(configuration, filePath)) { +.getFileReader(new HoodieConfig(), configuration, filePath)) { Review Comment: if its an empty one always, should we declare a singleton instance and re-use wherever required? you can name it DEFAULT_HUDI_CONFIG_FOR_READER ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -83,19 +89,29 @@ public HoodieHFileDataBlock(FSDataInputStream inputStream, Map header, Map footer, boolean enablePointLookups, - Path pathForReader) { -super(content, inputStream, readBlockLazily, Option.of(logBlockContentLocation), readerSc
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
hudi-bot commented on PR #10554: URL: https://github.com/apache/hudi/pull/10554#issuecomment-1909446353 ## CI report: * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144) * 58eaae52e37c5be3354346cab9a4f22769ff8129 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22158) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
hudi-bot commented on PR #10554: URL: https://github.com/apache/hudi/pull/10554#issuecomment-1909439147 ## CI report: * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144) * 58eaae52e37c5be3354346cab9a4f22769ff8129 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909432153 ## CI report: * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
xuzifu666 commented on code in PR #10554: URL: https://github.com/apache/hudi/pull/10554#discussion_r1465887278 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala: ## @@ -2160,6 +2174,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase { |union |select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 12 as `hour` |""".stripMargin) + val stageClassName = classOf[HoodieSparkEngineContext].getSimpleName + spark.sparkContext.addSparkListener(new StageParallelismListener(stageName = stageClassName)) Review Comment: @bvaradar OK,had add a counter to assert called atleast once. PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Issue with reading the debezium inputs [hudi]
zyperd opened a new issue, #10561: URL: https://github.com/apache/hudi/issues/10561 When hudi is reading the debezium ingested topics the following error message is displayed, kindly help to identify the issue ``` Caused by: java.lang.NoSuchMethodException: org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource.(org.apache.hudi.common.config.TypedProperties,org.apache.spark.api.java.JavaSparkContext,org.apache.spark.sql.SparkSession,org.apache.hudi.utilities.schema.SchemaProvider)``` In the source public MysqlDebeziumSource(TypedProperties props, JavaSparkContext sparkContext, SparkSession sparkSession, SchemaProvider schemaProvider, HoodieIngestionMetrics metrics) Is the spark-submit command missing any hudi config? hudi-aws-bundle.jar -> hudi-utilities-bundle_2.12-0.14.0-amzn-1.jar ``` ``` spark-submit \ --master yarn \ --deploy-mode cluster \ --driver-memory 2g --executor-memory 1g --num-executors 1 --executor-cores 1 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalogImplementation=hive \ --conf spark.driver.maxResultSize=1g \ --conf spark.speculation=true \ --conf spark.speculation.multiplier=1.0 \ --conf spark.speculation.quantile=0.5 \ --conf spark.ui.port=6680 \ --conf spark.eventLog.dir=s3://spark_events/ \ --conf spark.eventLog.enabled=true \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.scheduler.mode=FAIR \ --jars /usr/lib/hudi/hudi-aws-bundle.jar,/home/hadoop/kafka-avro-serializer-3.1.1.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \ --target-base-path s3://mysql_cdc/table_cdc/ \ --source-class org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource \ --payload-class org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --source-ordering-field id \ --target-table table_cdc \ --table-type COPY_ON_WRITE \ --op UPSERT \ --enable-hive-sync \ --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ --hoodie-conf auto.offset.reset=earliest \ --hoodie-conf bootstrap.servers=127.0.0.1:9002 \ --hoodie-conf hoodie.deltastreamer.source.kafka.topic="table_cdc" \ --hoodie-conf hoodie.deltastreamer.source.kafka.value.deserializer.class=org.apache.hudi.utilities.deser.KafkaAvroSchemaDeserializer \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.database=default \ --hoodie-conf hoodie.datasource.hive_sync.table=table_cdc \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.write.recordkey.field=id \ --hoodie-conf hoodie.datasource.write.partitionpath.field=value_type \ --hoodie-conf hoodie.compaction.payload.class=org.apache.hudi.common.model.DebeziumAvroPayload \ --hoodie-conf hoodie.table.name=table_cdc \ --hoodie-conf hoodie.streamer.schemaprovider.source.schema.file=file:///source.avsc \ --hoodie-conf hoodie.streamer.schemaprovider.target.schema.file=file:///target.avsc \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=value_type \ --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://cdc-events/ \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms ``` #source.avsc ``` { "type": "record", "name": "ChangeEvent", "fields": [ { "name": "before", "type": ["null", "string"] }, { "name": "after", "type": { "type": "record", "name": "After", "fields": [ { "name": "id", "type": ["int"] }, { "name": "values", "type": "string" }, { "name": "value_type", "type": "string" }, ] } }, { "name": "source", "type": { "type": "record", "name": "Source", "fields": [ { "name": "version", "type": ["null", "string"] }, { "name": "connector", "type": ["null", "string"] }, { "name": "name", "type": ["null", "string"] }, { "name": "ts_ms", "type": ["null", "long"] }, { "name": "snapshot", "t
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
yihua merged PR #10549: URL: https://github.com/apache/hudi/pull/10549 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (11861c8a50e -> f2b24a149c1)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 11861c8a50e [HUDI-7298] Write bad records to error table in more cases instead of failing stream (#10500) add f2b24a149c1 [HUDI-7323] Use a schema supplier instead of a static value (#10549) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/utilities/UtilHelpers.java | 7 +++--- .../apache/hudi/utilities/streamer/StreamSync.java | 15 +-- .../utilities/transform/ChainedTransformer.java| 12 + .../ErrorTableAwareChainedTransformer.java | 5 ++-- .../functional/TestChainedTransformer.java | 29 +++--- .../TestErrorTableAwareChainedTransformer.java | 4 +-- 6 files changed, 48 insertions(+), 24 deletions(-)
Re: [I] [SUPPORT] After upgrading hudi 0.14.1, use Spark SQL merge into to update the matched_action, the case of the column name and the expression name does not match, resulting in an exception. [hu
yihao-tcf commented on issue #10558: URL: https://github.com/apache/hudi/issues/10558#issuecomment-1909366472 > @yihao-tcf @jonvex hi, any plan fix it? If not, I can try to fix it @KnightChess I don't have any plans here. Thank you for fixing this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]
nsivabalan commented on code in PR #10238: URL: https://github.com/apache/hudi/pull/10238#discussion_r1465848613 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -652,7 +653,7 @@ public static HoodieData convertMetadataToColumnStatsRecords(Hoodi String partitionPath = deleteFileInfoPair.getLeft(); String filePath = deleteFileInfoPair.getRight(); - if (filePath.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) { + if (filePath.endsWith(HoodieFileFormat.PARQUET.getFileExtension()) || ExternalFilePathUtil.isExternallyCreatedFile(filePath)) { Review Comment: I guess there is some gap here wrt log files. for log files, we get stats directly from Append Handle and entries are added to col stats using log file name. so, during clean commit metadata, we should be deleting both data files and log files. https://issues.apache.org/jira/browse/HUDI-7331 I have created a follow up ticket on this. lets fix it thoroughly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7331) Test and certify col stats integration with MOR table
sivabalan narayanan created HUDI-7331: - Summary: Test and certify col stats integration with MOR table Key: HUDI-7331 URL: https://issues.apache.org/jira/browse/HUDI-7331 Project: Apache Hudi Issue Type: Bug Components: metadata Reporter: sivabalan narayanan Lets test and certify col stats integration with MOR table for all operations. for eg, any write operations (bulk insert, insert, upsert, insert overwrite) should add new entries to col stats index in metadata table. rollback: for files that were deleted should be removed from col stats (data files). for log files added, we should add new entries to col stats clean: any files deleted (data files and log files) should have the entries removed from col stats in MDT. Similarly, lets also do similar exercise with delete partition and other operations we have with hudi. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1465827024 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -0,0 +1,279 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieReaderContext; +import org.apache.hudi.common.model.DeleteRecord; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.KeySpec; +import org.apache.hudi.common.table.log.block.HoodieDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieCorruptedDataException; +import org.apache.hudi.exception.HoodieKeyException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.avro.Schema; +import org.roaringbitmap.longlong.Roaring64NavigableMap; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Set; + +public abstract class HoodieBaseFileGroupRecordBuffer implements HoodieFileGroupRecordBuffer { + protected final HoodieReaderContext readerContext; + protected final Schema readerSchema; + protected final Schema baseFileSchema; + protected final Option partitionNameOverrideOpt; + protected final Option partitionPathFieldOpt; + protected final HoodieRecordMerger recordMerger; + protected final TypedProperties payloadProps; + protected final HoodieTableMetaClient hoodieTableMetaClient; + protected final Map, Map>> records; + protected ClosableIterator baseFileIterator; + protected Iterator, Map>> logRecordIterator; + protected T nextRecord; + + public HoodieBaseFileGroupRecordBuffer(HoodieReaderContext readerContext, + Schema readerSchema, + Schema baseFileSchema, + Option partitionNameOverrideOpt, + Option partitionPathFieldOpt, + HoodieRecordMerger recordMerger, + TypedProperties payloadProps, + HoodieTableMetaClient hoodieTableMetaClient) { +this.readerContext = readerContext; +this.readerSchema = readerSchema; +this.baseFileSchema = baseFileSchema; +this.partitionNameOverrideOpt = partitionNameOverrideOpt; +this.partitionPathFieldOpt = partitionPathFieldOpt; +this.recordMerger = recordMerger; +this.payloadProps = payloadProps; +this.hoodieTableMetaClient = hoodieTableMetaClient; +this.records = new HashMap<>(); Review Comment: The sequence of the log records got lost by using the `HashMap#values`, we should fix it. And in general, should we cache all the log records in memory, I don't think it is reasonable, we should use spillable map here. And we also needs to support unmerged log reader for streaming read scenarios, for this case, we should not buffer the log records actually. The log read sequence should be ensured too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] new hudi content for 01-2024 [hudi]
nfarah86 opened a new pull request, #10560: URL: https://github.com/apache/hudi/pull/10560 new pr content for hudi blogs cc @bhasudha https://github.com/apache/hudi/assets/5392555/00daee1b-bb2a-4850-b155-619a3c2a3383";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6786] HoodieFileGroupReader integration [hudi]
danny0405 commented on code in PR #9819: URL: https://github.com/apache/hudi/pull/9819#discussion_r1465827024 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -0,0 +1,279 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.table.read; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.engine.HoodieReaderContext; +import org.apache.hudi.common.model.DeleteRecord; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordMerger; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.log.KeySpec; +import org.apache.hudi.common.table.log.block.HoodieDataBlock; +import org.apache.hudi.common.table.log.block.HoodieLogBlock; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.ReflectionUtils; +import org.apache.hudi.common.util.collection.ClosableIterator; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieCorruptedDataException; +import org.apache.hudi.exception.HoodieKeyException; +import org.apache.hudi.exception.HoodieValidationException; + +import org.apache.avro.Schema; +import org.roaringbitmap.longlong.Roaring64NavigableMap; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Set; + +public abstract class HoodieBaseFileGroupRecordBuffer implements HoodieFileGroupRecordBuffer { + protected final HoodieReaderContext readerContext; + protected final Schema readerSchema; + protected final Schema baseFileSchema; + protected final Option partitionNameOverrideOpt; + protected final Option partitionPathFieldOpt; + protected final HoodieRecordMerger recordMerger; + protected final TypedProperties payloadProps; + protected final HoodieTableMetaClient hoodieTableMetaClient; + protected final Map, Map>> records; + protected ClosableIterator baseFileIterator; + protected Iterator, Map>> logRecordIterator; + protected T nextRecord; + + public HoodieBaseFileGroupRecordBuffer(HoodieReaderContext readerContext, + Schema readerSchema, + Schema baseFileSchema, + Option partitionNameOverrideOpt, + Option partitionPathFieldOpt, + HoodieRecordMerger recordMerger, + TypedProperties payloadProps, + HoodieTableMetaClient hoodieTableMetaClient) { +this.readerContext = readerContext; +this.readerSchema = readerSchema; +this.baseFileSchema = baseFileSchema; +this.partitionNameOverrideOpt = partitionNameOverrideOpt; +this.partitionPathFieldOpt = partitionPathFieldOpt; +this.recordMerger = recordMerger; +this.payloadProps = payloadProps; +this.hoodieTableMetaClient = hoodieTableMetaClient; +this.records = new HashMap<>(); Review Comment: The sequence of the log records got lost by using the `HashMap#values`, we should fix it. And in general, should we cache all the log records in memory, I don't think it is reasonable, when the base file is empty, it is feasible we just keep an iterator of the log files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909332258 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN * 74b8a6658f324313bec3525aae40a3203a8c6bc1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7215) Delete NewHoodieParquetFileFormat and all references
[ https://issues.apache.org/jira/browse/HUDI-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-7215. - Resolution: Fixed > Delete NewHoodieParquetFileFormat and all references > > > Key: HUDI-7215 > URL: https://issues.apache.org/jira/browse/HUDI-7215 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > HoodieFileGroupReaderBasedParquetFileFormat now has feature parity with > NewHoodieParquetFileFormat and no new work will be done on > NewHoodieParquetFileFormat. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7244) Ensure ClosableIterator is propagated all the way to FileScanRDD
[ https://issues.apache.org/jira/browse/HUDI-7244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-7244. - Resolution: Fixed > Ensure ClosableIterator is propagated all the way to FileScanRDD > > > Key: HUDI-7244 > URL: https://issues.apache.org/jira/browse/HUDI-7244 > Project: Apache Hudi > Issue Type: Bug > Components: spark, spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Critical > Labels: pull-request-available > > CI tests are OOMing. One cause is that resources are not being freed from the > new filegroup reader. After some code inspection, it was found that close is > not being called in the HoodieFileGroupReaderIterator -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution
[ https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7045: -- Status: In Progress (was: Open) > Fix new file format and reader for schema evolution > --- > > Key: HUDI-7045 > URL: https://issues.apache.org/jira/browse/HUDI-7045 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > When this is implemented, parquet readers should not be created in > HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can > uncomment/add the code from this commit: > [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7296) Reduce combinations for some tests to make ci faster
[ https://issues.apache.org/jira/browse/HUDI-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-7296. - Resolution: Fixed > Reduce combinations for some tests to make ci faster > > > Key: HUDI-7296 > URL: https://issues.apache.org/jira/browse/HUDI-7296 > Project: Apache Hudi > Issue Type: Test >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many > combinations of params. While it is good to test everything, there are lots > of code paths that have extensive duplicate testing. Reduce the number of > tests while still maintaining code coverage -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6787: -- Status: Patch Available (was: In Progress) > Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > - > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6872) Simplify Out Of Box Schema Evolution Functionality
[ https://issues.apache.org/jira/browse/HUDI-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-6872. - Resolution: Fixed > Simplify Out Of Box Schema Evolution Functionality > -- > > Key: HUDI-6872 > URL: https://issues.apache.org/jira/browse/HUDI-6872 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer, spark, spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Test schema evolution capabilities out of the box for deltastreamer and > datasource. Make schema evolution out of the box easy to understand and use -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7327) hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work with HoodieIncrSource unless meta cols are dropped
[ https://issues.apache.org/jira/browse/HUDI-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7327: -- Status: Patch Available (was: In Progress) > hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work > with HoodieIncrSource unless meta cols are dropped > -- > > Key: HUDI-7327 > URL: https://issues.apache.org/jira/browse/HUDI-7327 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > The incoming meta cols are treated as new columns which is not allowed by > internalschema so it fails -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7298) Write bad records to error table in more cases instead of failing stream
[ https://issues.apache.org/jira/browse/HUDI-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-7298. - Resolution: Fixed > Write bad records to error table in more cases instead of failing stream > > > Key: HUDI-7298 > URL: https://issues.apache.org/jira/browse/HUDI-7298 > Project: Apache Hudi > Issue Type: Improvement > Components: deltastreamer, spark >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Critical > Labels: pull-request-available > > If no transformer is used, but schema provider is used, records with the > incorrect schema will not be detected and will fail the stream during > HoodieRecord creation. Additionally, during keygeneration the stream can > crash if required fields are null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7327) hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work with HoodieIncrSource unless meta cols are dropped
[ https://issues.apache.org/jira/browse/HUDI-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7327: -- Status: In Progress (was: Open) > hoodie.write.handle.missing.cols.with.lossless.type.promotion does not work > with HoodieIncrSource unless meta cols are dropped > -- > > Key: HUDI-7327 > URL: https://issues.apache.org/jira/browse/HUDI-7327 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > The incoming meta cols are treated as new columns which is not allowed by > internalschema so it fails -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7298] Write bad records to error table in more cases instead of failing stream (#10500)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 11861c8a50e [HUDI-7298] Write bad records to error table in more cases instead of failing stream (#10500) 11861c8a50e is described below commit 11861c8a50e7dd23186d44bdc7aef871e5fc1280 Author: Jon Vexler AuthorDate: Wed Jan 24 22:59:29 2024 -0500 [HUDI-7298] Write bad records to error table in more cases instead of failing stream (#10500) Cases: - No transformers, with schema provider. Records will go to the error table if they cannot be rewritten in the deduced schema. - recordkey is null, even if the column is nullable in the schema --- .../apache/hudi/config/HoodieErrorTableConfig.java | 6 ++ .../scala/org/apache/hudi/HoodieSparkUtils.scala | 21 + .../java/org/apache/hudi/avro/HoodieAvroUtils.java | 33 ++- .../org/apache/hudi/TestHoodieSparkUtils.scala | 4 + .../apache/hudi/utilities/streamer/ErrorEvent.java | 6 +- .../utilities/streamer/HoodieStreamerUtils.java| 68 ++ .../apache/hudi/utilities/streamer/StreamSync.java | 19 +++- ...TestHoodieDeltaStreamerSchemaEvolutionBase.java | 63 + ...oodieDeltaStreamerSchemaEvolutionExtensive.java | 100 +++-- ...estHoodieDeltaStreamerSchemaEvolutionQuick.java | 18 ++-- .../utilities/sources/TestGenericRddTransform.java | 29 ++ .../schema-evolution/testMissingRecordKey.json | 2 + 12 files changed, 334 insertions(+), 35 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java index 68e2097c33b..8ba013b00ee 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieErrorTableConfig.java @@ -72,6 +72,12 @@ public class HoodieErrorTableConfig { .defaultValue(false) .withDocumentation("Records with schema mismatch with Target Schema are sent to Error Table."); + public static final ConfigProperty ERROR_ENABLE_VALIDATE_RECORD_CREATION = ConfigProperty + .key("hoodie.errortable.validate.recordcreation.enable") + .defaultValue(true) + .sinceVersion("0.14.2") + .withDocumentation("Records that fail to be created due to keygeneration failure or other issues will be sent to the Error Table"); + public static final ConfigProperty ERROR_TABLE_WRITE_FAILURE_STRATEGY = ConfigProperty .key("hoodie.errortable.write.failure.strategy") .defaultValue(ErrorWriteFailureStrategy.ROLLBACK_COMMIT.name()) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala index 527864fcf24..535af8db193 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala @@ -199,6 +199,27 @@ object HoodieSparkUtils extends SparkAdapterSupport with SparkVersionsSupport wi } } + /** + * Rerwite the record into the target schema. + * Return tuple of rewritten records and records that could not be converted + */ + def safeRewriteRDD(df: RDD[GenericRecord], serializedTargetSchema: String): Tuple2[RDD[GenericRecord], RDD[String]] = { +val rdds: RDD[Either[GenericRecord, String]] = df.mapPartitions { recs => + if (recs.isEmpty) { +Iterator.empty + } else { +val schema = new Schema.Parser().parse(serializedTargetSchema) +val transform: GenericRecord => Either[GenericRecord, String] = record => try { + Left(HoodieAvroUtils.rewriteRecordDeep(record, schema, true)) +} catch { + case _: Throwable => Right(HoodieAvroUtils.avroToJsonString(record, false)) +} +recs.map(transform) + } +} +(rdds.filter(_.isLeft).map(_.left.get), rdds.filter(_.isRight).map(_.right.get)) + } + def getCatalystRowSerDe(structType: StructType): SparkRowSerDe = { sparkAdapter.createSparkRowSerDe(structType) } diff --git a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java index ac7dcd42979..9b925eb59be 100644 --- a/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java +++ b/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java @@ -187,6 +187,16 @@ public class HoodieAvroUtils { } } + /** + * Convert a given avro record to json and return the string + * + * @param record The GenericRecord to convert + * @param pretty Whether to
Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]
codope merged PR #10500: URL: https://github.com/apache/hudi/pull/10500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909294966 ## CI report: * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145) * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909290294 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 7476839c8fde914ff1e201af11f591f46fec392e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153) * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN * 74b8a6658f324313bec3525aae40a3203a8c6bc1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22154) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1909290227 ## CI report: * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145) * 9c2e36ff019825e1b3e208e7a8ae0d0252029ea3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909284023 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 7476839c8fde914ff1e201af11f591f46fec392e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153) * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN * 74b8a6658f324313bec3525aae40a3203a8c6bc1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]
hudi-bot commented on PR #10223: URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909283810 ## CI report: * d197ce8180f3f11e30d2254733c46f137e12376c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22152) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
danny0405 commented on code in PR #10344: URL: https://github.com/apache/hudi/pull/10344#discussion_r1465787117 ## hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java: ## @@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, DiskMapType.BITCASK); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, diskMapType, false); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType, boolean isCompressionEnabled) throws IOException { this.inMemoryMap = new HashMap<>(); this.baseFilePath = baseFilePath; -this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * sizingFactorForInMemoryMap); +this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * SIZING_FACTOR_FOR_IN_MEMORY_MAP); this.currentInMemoryMapSize = 0L; this.keySizeEstimator = keySizeEstimator; this.valueSizeEstimator = valueSizeEstimator; this.diskMapType = diskMapType; this.isCompressionEnabled = isCompressionEnabled; } + private DiskMap getDiskBasedMap() { +return getDiskBasedMap(false); + } + + private DiskMap getOrCreateDiskBasedMap() { +return getDiskBasedMap(true); + } + private DiskMap getDiskBasedMap(boolean forceInitialization) { if (null == diskBasedMap) { - if (!forceInitialization) { -return DiskMap.empty(); - } synchronized (this) { if (null == diskBasedMap) { + if (!forceInitialization) { +return DiskMap.empty(); Review Comment: > We can avoid the dummy empty map by also embedding null checks into all of the methods Somehow makes sense, I just thought it might be more straight-forward to do that in specific map impls. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
the-other-tim-brown commented on code in PR #10344: URL: https://github.com/apache/hudi/pull/10344#discussion_r1465785310 ## hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java: ## @@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, DiskMapType.BITCASK); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, diskMapType, false); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType, boolean isCompressionEnabled) throws IOException { this.inMemoryMap = new HashMap<>(); this.baseFilePath = baseFilePath; -this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * sizingFactorForInMemoryMap); +this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * SIZING_FACTOR_FOR_IN_MEMORY_MAP); this.currentInMemoryMapSize = 0L; this.keySizeEstimator = keySizeEstimator; this.valueSizeEstimator = valueSizeEstimator; this.diskMapType = diskMapType; this.isCompressionEnabled = isCompressionEnabled; } + private DiskMap getDiskBasedMap() { +return getDiskBasedMap(false); + } + + private DiskMap getOrCreateDiskBasedMap() { +return getDiskBasedMap(true); + } + private DiskMap getDiskBasedMap(boolean forceInitialization) { if (null == diskBasedMap) { - if (!forceInitialization) { -return DiskMap.empty(); - } synchronized (this) { if (null == diskBasedMap) { + if (!forceInitialization) { +return DiskMap.empty(); Review Comment: At some point you will need to pay special attention to whether the read/write methods are correct. Right now we are just debating about where that is, is that correct? In my opinion, the ExternalSpillableMap is the logical place for handling the logic of initializing the disk map if it is needed. We can avoid the dummy empty map by also embedding null checks into all of the methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
danny0405 commented on code in PR #10344: URL: https://github.com/apache/hudi/pull/10344#discussion_r1465781096 ## hudi-common/src/main/java/org/apache/hudi/common/util/collection/ExternalSpillableMap.java: ## @@ -78,41 +78,49 @@ public class ExternalSpillableMap keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, DiskMapType.BITCASK); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType) throws IOException { this(maxInMemorySizeInBytes, baseFilePath, keySizeEstimator, valueSizeEstimator, diskMapType, false); } - public ExternalSpillableMap(Long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, + public ExternalSpillableMap(long maxInMemorySizeInBytes, String baseFilePath, SizeEstimator keySizeEstimator, SizeEstimator valueSizeEstimator, DiskMapType diskMapType, boolean isCompressionEnabled) throws IOException { this.inMemoryMap = new HashMap<>(); this.baseFilePath = baseFilePath; -this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * sizingFactorForInMemoryMap); +this.maxInMemorySizeInBytes = (long) Math.floor(maxInMemorySizeInBytes * SIZING_FACTOR_FOR_IN_MEMORY_MAP); this.currentInMemoryMapSize = 0L; this.keySizeEstimator = keySizeEstimator; this.valueSizeEstimator = valueSizeEstimator; this.diskMapType = diskMapType; this.isCompressionEnabled = isCompressionEnabled; } + private DiskMap getDiskBasedMap() { +return getDiskBasedMap(false); + } + + private DiskMap getOrCreateDiskBasedMap() { +return getDiskBasedMap(true); + } + private DiskMap getDiskBasedMap(boolean forceInitialization) { if (null == diskBasedMap) { - if (!forceInitialization) { -return DiskMap.empty(); - } synchronized (this) { if (null == diskBasedMap) { + if (!forceInitialization) { +return DiskMap.empty(); Review Comment: I'm wondering if we can make the two map implementations initiaze lazily by themselves so that in this ExternalSpillableMap there is no need to pay special attention to make the read/write mehods as corrrect, and there is no need to introduce the dummy empty map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]
jonvex commented on PR #10500: URL: https://github.com/apache/hudi/pull/10500#issuecomment-1909253547 azure ci passing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7218] Integrate new HFile reader with file reader factory [hudi]
vinothchandar commented on code in PR #10330: URL: https://github.com/apache/hudi/pull/10330#discussion_r1459536068 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -182,7 +183,7 @@ public static List> filterKeysFromFile(Path filePath, List> foundRecordKeys = new ArrayList<>(); try (HoodieFileReader fileReader = HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO) -.getFileReader(configuration, filePath)) { +.getFileReader(new HoodieConfig(), configuration, filePath)) { Review Comment: this feels a little odd to be passing in an empty properties list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909247856 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 7476839c8fde914ff1e201af11f591f46fec392e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153) * fd05a7d87c676275e5f5e329e0207cc97ec9adfb UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]
hudi-bot commented on PR #10556: URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909242727 ## CI report: * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909198297 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 7476839c8fde914ff1e201af11f591f46fec392e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909164123 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149) * 7476839c8fde914ff1e201af11f591f46fec392e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22153) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]
hudi-bot commented on PR #10223: URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909157450 ## CI report: * 157fb0e8df7b87579ca64a2d3a64212675baf644 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21923) * d197ce8180f3f11e30d2254733c46f137e12376c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22152) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
hudi-bot commented on PR #10549: URL: https://github.com/apache/hudi/pull/10549#issuecomment-1909151475 ## CI report: * ca627db36503a81c4223edde799bd344b9cf2b05 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [] CVE-2023-44487 Upgrade jetty and exclude older jetty [hudi]
hudi-bot commented on PR #10223: URL: https://github.com/apache/hudi/pull/10223#issuecomment-1909150913 ## CI report: * 157fb0e8df7b87579ca64a2d3a64212675baf644 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21923) * d197ce8180f3f11e30d2254733c46f137e12376c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909143577 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149) * 7476839c8fde914ff1e201af11f591f46fec392e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi DeltaStreamer with Flattening Transformer [hudi]
soumilshah1995 closed issue #10499: [SUPPORT] Hudi DeltaStreamer with Flattening Transformer URL: https://github.com/apache/hudi/issues/10499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi DeltaStreamer with Flattening Transformer [hudi]
soumilshah1995 commented on issue #10499: URL: https://github.com/apache/hudi/issues/10499#issuecomment-1909122022 I would need some time to play with flattening transformer need to setup a test project to see if works let me close this and reopen it later again as I would be doing these test most likely next week -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]
hudi-bot commented on PR #10556: URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909106935 ## CI report: * fd66a2b6c21e32cc340e3a813acf826dc83b3547 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22131) * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22151) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]
hudi-bot commented on PR #10556: URL: https://github.com/apache/hudi/pull/10556#issuecomment-1909099807 ## CI report: * fd66a2b6c21e32cc340e3a813acf826dc83b3547 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22131) * 6fb0ee2e5f0edcdf7657269973eb0968d0d7b0fa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909089865 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (77833cdb096 -> a83f7c03836)
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 77833cdb096 [HUDI-7311] Add implicit literal type conversion before filter push down (#10531) add a83f7c03836 [HUDI-7228] Fix eager closure of log reader input streams with log record reader (#10340) No new revisions were added by this update. Summary of changes: .../utils/LegacyArchivedMetaEntryReader.java | 7 +++ .../hudi/common/table/log/HoodieLogFileReader.java | 52 +- .../common/table/log/HoodieLogFormatReader.java| 32 +++-- .../table/log/block/HoodieAvroDataBlock.java | 5 ++- .../common/table/log/block/HoodieCDCDataBlock.java | 5 ++- .../common/table/log/block/HoodieCommandBlock.java | 5 ++- .../common/table/log/block/HoodieCorruptBlock.java | 5 ++- .../common/table/log/block/HoodieDataBlock.java| 5 ++- .../common/table/log/block/HoodieDeleteBlock.java | 9 ++-- .../table/log/block/HoodieHFileDataBlock.java | 5 ++- .../common/table/log/block/HoodieLogBlock.java | 11 ++--- .../table/log/block/HoodieParquetDataBlock.java| 5 ++- 12 files changed, 65 insertions(+), 81 deletions(-)
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
bvaradar merged PR #10340: URL: https://github.com/apache/hudi/pull/10340 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1909026446 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * b78aacdea8818d79256550e0ca2f2bb32708811e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22135) * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22149) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7327] remove meta cols from incoming schema in stream sync [hudi]
jonvex commented on code in PR #10556: URL: https://github.com/apache/hudi/pull/10556#discussion_r1465590724 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerTestBase.java: ## @@ -609,6 +610,7 @@ static HoodieDeltaStreamer.Config makeConfigForHudiIncrSrc(String srcBasePath, S cfg.schemaProviderClassName = schemaProviderClassName; } List cfgs = new ArrayList<>(); + cfgs.add(HANDLE_MISSING_COLUMNS_WITH_LOSSLESS_TYPE_PROMOTIONS.key() + "=true"); Review Comment: yes. Without the change the stream sync line 664 it fails -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
hudi-bot commented on PR #10549: URL: https://github.com/apache/hudi/pull/10549#issuecomment-1908955641 ## CI report: * ee8ed782107e9ef4aa7ebe50fa22fc68c6c14602 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22104) * ca627db36503a81c4223edde799bd344b9cf2b05 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6497] WIP HoodieStorage abstraction [hudi]
hudi-bot commented on PR #10360: URL: https://github.com/apache/hudi/pull/10360#issuecomment-1908955169 ## CI report: * 0a958d6408a7d0107ae2dcfc2aae676fd1a6977d UNKNOWN * 6632d6e715eec0e54ae047f1d89c8f979ac8639d UNKNOWN * b78aacdea8818d79256550e0ca2f2bb32708811e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22135) * 21bd59426ea0b1c4f3ecb9dd7fda124d9e3b3522 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
hudi-bot commented on PR #10549: URL: https://github.com/apache/hudi/pull/10549#issuecomment-1908945912 ## CI report: * ee8ed782107e9ef4aa7ebe50fa22fc68c6c14602 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22104) * ca627db36503a81c4223edde799bd344b9cf2b05 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
the-other-tim-brown commented on code in PR #10549: URL: https://github.com/apache/hudi/pull/10549#discussion_r1465518543 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java: ## @@ -120,6 +121,7 @@ private void validateIdentifier(String id, Set identifiers, String confi private StructType getExpectedTransformedSchema(TransformerInfo transformerInfo, JavaSparkContext jsc, SparkSession sparkSession, Option incomingStructOpt, Option> rowDatasetOpt, TypedProperties properties) { +Option sourceSchemaOpt = sourceSchemaSupplier.get(); Review Comment: added a test to validate that the supplier is called per invocation of the method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908864979 ## CI report: * f401ab103abf2eb6e2827a98fa8627795642f064 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22147) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7323] Use a schema supplier instead of a static value [hudi]
yihua commented on code in PR #10549: URL: https://github.com/apache/hudi/pull/10549#discussion_r1465442627 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/transform/ChainedTransformer.java: ## @@ -120,6 +121,7 @@ private void validateIdentifier(String id, Set identifiers, String confi private StructType getExpectedTransformedSchema(TransformerInfo transformerInfo, JavaSparkContext jsc, SparkSession sparkSession, Option incomingStructOpt, Option> rowDatasetOpt, TypedProperties properties) { +Option sourceSchemaOpt = sourceSchemaSupplier.get(); Review Comment: Could you add a test of a scenario where the schema is evolved from the schema provider, and the `getExpectedTransformedSchema` returns the updated `StructType` instance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]
hudi-bot commented on PR #10500: URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908713547 ## CI report: * 93deb5002c4379e20f9aef4813d5ae3100513e11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22146) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908713120 ## CI report: * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908688740 ## CI report: * 8d999c7e7946d2dc3d05e8bd7ebf53d5d5e8a57a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21558) * f401ab103abf2eb6e2827a98fa8627795642f064 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22147) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]
hudi-bot commented on PR #10500: URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908621790 ## CI report: * edf05d8127d2281fb7ad62747f494f0f3a2e9b2c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22128) * 93deb5002c4379e20f9aef4813d5ae3100513e11 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22146) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
hudi-bot commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908621294 ## CI report: * 8d999c7e7946d2dc3d05e8bd7ebf53d5d5e8a57a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21558) * f401ab103abf2eb6e2827a98fa8627795642f064 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7298] Write bad records to error table in more cases instead of failing stream [hudi]
hudi-bot commented on PR #10500: URL: https://github.com/apache/hudi/pull/10500#issuecomment-1908608813 ## CI report: * edf05d8127d2281fb7ad62747f494f0f3a2e9b2c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22128) * 93deb5002c4379e20f9aef4813d5ae3100513e11 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7228] Fix eager closure of log reader input streams with log record reader [hudi]
bvaradar commented on PR #10340: URL: https://github.com/apache/hudi/pull/10340#issuecomment-1908605720 Fixed Conflicts and updated the diff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
bvaradar commented on code in PR #10554: URL: https://github.com/apache/hudi/pull/10554#discussion_r1465256370 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala: ## @@ -2160,6 +2174,8 @@ class TestInsertTable extends HoodieSparkSqlTestBase { |union |select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 12 as `hour` |""".stripMargin) + val stageClassName = classOf[HoodieSparkEngineContext].getSimpleName + spark.sparkContext.addSparkListener(new StageParallelismListener(stageName = stageClassName)) Review Comment: @xuzifu666 Can you have StageParallelismListener update a shared counter (static) and assert here that the count increased by atleast one to ensure StageParallelismListener was indeed called as expected ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908511796 ## CI report: * d5c669fdb2b061ff6e65b42aa969be2902c033c7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21730) * f0d32bea4e960cd85b8e344597ec4f006c213b44 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22145) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7238] Bug fixes and optimization of ExternalSpillableMap [hudi]
hudi-bot commented on PR #10344: URL: https://github.com/apache/hudi/pull/10344#issuecomment-1908495665 ## CI report: * d5c669fdb2b061ff6e65b42aa969be2902c033c7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21730) * f0d32bea4e960cd85b8e344597ec4f006c213b44 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.
[ https://issues.apache.org/jira/browse/HUDI-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka closed HUDI-7330. --- Fix Version/s: (was: 1.1.0) Resolution: Duplicate Duplicate of already known bug - https://issues.apache.org/jira/browse/HUDI-6472 So cancelling this > With 0.14 upgrade, MIT failing with mismatched case in field names. > --- > > Key: HUDI-7330 > URL: https://issues.apache.org/jira/browse/HUDI-7330 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Aditya Goenka >Priority: Critical > > With 0.14.0 upgrade, MIT is failing when the case of the fields do not match. > > Reproducible Code - > create table merge_source ( > id int, name string, price double > ) using hudi > tblproperties > (primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, > "new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table ( > id INT, > name STRING, > price DOUBLE > ) USING hudi > tblproperties > (primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, > "oldid2", 200.00); > merge into hudi_table as target > using merge_source as source > on target.id = source.id > when matched then update set ID=source.ID, name=source.name -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6464) Implement Spark SQL Merge Into for tables without primary key
[ https://issues.apache.org/jira/browse/HUDI-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler closed HUDI-6464. - Resolution: Fixed > Implement Spark SQL Merge Into for tables without primary key > - > > Key: HUDI-6464 > URL: https://issues.apache.org/jira/browse/HUDI-6464 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Merge Into currently only matches on the primary key which pkless tables > don't have -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.
[ https://issues.apache.org/jira/browse/HUDI-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810480#comment-17810480 ] Aditya Goenka commented on HUDI-7330: - Github issue - [https://github.com/apache/hudi/issues/10558] > With 0.14 upgrade, MIT failing with mismatched case in field names. > --- > > Key: HUDI-7330 > URL: https://issues.apache.org/jira/browse/HUDI-7330 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Reporter: Aditya Goenka >Priority: Critical > Fix For: 1.1.0 > > > With 0.14.0 upgrade, MIT is failing when the case of the fields do not match. > > Reproducible Code - > create table merge_source ( > id int, name string, price double > ) using hudi > tblproperties > (primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, > "new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table ( > id INT, > name STRING, > price DOUBLE > ) USING hudi > tblproperties > (primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, > "oldid2", 200.00); > merge into hudi_table as target > using merge_source as source > on target.id = source.id > when matched then update set ID=source.ID, name=source.name -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]
ad1happy2go commented on issue #10559: URL: https://github.com/apache/hudi/issues/10559#issuecomment-1908435279 @rishabhreply Sorry, but I am a bit confused. Do you really want to use insert_overwrite in this case? If you just submit two parallel jobs with insert_overwrite, one is going to overwrite the others data in any case. Even if you sequentially then also you will miss the data ingested by first one. So you can only use insert_overwrite if you want to process all 10 files in one batch. Let me know in case I am not thinking in right direction -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7330) With 0.14 upgrade, MIT failing with mismatched case in field names.
Aditya Goenka created HUDI-7330: --- Summary: With 0.14 upgrade, MIT failing with mismatched case in field names. Key: HUDI-7330 URL: https://issues.apache.org/jira/browse/HUDI-7330 Project: Apache Hudi Issue Type: Bug Components: spark-sql Reporter: Aditya Goenka Fix For: 1.1.0 With 0.14.0 upgrade, MIT is failing when the case of the fields do not match. Reproducible Code - create table merge_source ( id int, name string, price double ) using hudi tblproperties (primaryKey = 'id');insert into merge_source values (1, "old_a1", 22.22), (2, "new_a2", 33.33), (3, "new_a3", 44.44);create table hudi_table ( id INT, name STRING, price DOUBLE ) USING hudi tblproperties (primaryKey = 'id');insert into hudi_table values (1, "oldid1", 100.00), (2, "oldid2", 200.00); merge into hudi_table as target using merge_source as source on target.id = source.id when matched then update set ID=source.ID, name=source.name -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz commented on issue #10511: URL: https://github.com/apache/hudi/issues/10511#issuecomment-1908210748 >What do you think about, TBH a bit of mixed emotions here. With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries apart from "output number of rows" in Spark SQL dataframe, i.e. are they used at all and if they are, how effectively? This issue could be closed, from out end we'll move further with assumption that indexing and statistical means in hudi are ineffective, though we'd enable them on our critical fields in case further releases of hudi would implement performance improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
hudi-bot commented on PR #10554: URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908112403 ## CI report: * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6230] Handle aws glue partition index [hudi]
parisni commented on code in PR #8743: URL: https://github.com/apache/hudi/pull/8743#discussion_r1464900587 ## hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java: ## @@ -432,6 +443,120 @@ public void createTable(String tableName, } } + /** + * This will manage partitions indexes. Users can activate/deactivate them on existing tables. + * Removing index definition, will result in dropping the index. + * + * reference doc for partition indexes: + * https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html#partition-index-getpartitions + * + * @param tableName + */ + public void managePartitionIndexes(String tableName) throws ExecutionException, InterruptedException { +if (!config.getBooleanOrDefault(META_SYNC_PARTITION_INDEX_FIELDS_ENABLE)) { + // deactivate indexing if enabled + if (getPartitionIndexEnable(tableName)) { +LOG.warn("Deactivating partition indexing"); Review Comment: yes. The suggestion to use moto to mock aws glue is great. However it does not support partition index right now. So moto should be considered as a basis for IT in the hudi-aws, but not in this PR. BTW I tested this and provided a python script for people to try this out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
hudi-bot commented on PR #10554: URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908034214 ## CI report: * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN * 6af5f8ec3a1fb5459e4a0eb65f9ed152b4bbab2c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22117) * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22144) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
hudi-bot commented on PR #10554: URL: https://github.com/apache/hudi/pull/10554#issuecomment-1908022512 ## CI report: * e6934024c687f7deb7942e0edb833818aa96b843 UNKNOWN * 6af5f8ec3a1fb5459e4a0eb65f9ed152b4bbab2c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22117) * c3c58fa1feb8bf451e9d0d6cf7e074fe08010dbe UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fix UT error in HUDI-6941 with stage task numbers [hudi]
xuzifu666 commented on code in PR #10554: URL: https://github.com/apache/hudi/pull/10554#discussion_r1464799184 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala: ## @@ -2160,6 +2172,7 @@ class TestInsertTable extends HoodieSparkSqlTestBase { |union |select '1' as id, 'aa' as name, 123 as dt, '2023-10-12' as `day`, 12 as `hour` |""".stripMargin) + spark.sparkContext.addSparkListener(new StageParallelismListener(stageName = "collect at HoodieSparkEngineContext.java")) Review Comment: Hi,I had try dependent to the class,in this case all query stage would relate to HoodieSparkEngineContext class,so change it with this class to check @bvaradar PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
CamelliaYjli commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907801206 > Yeah, you should use `HoodieHiveInputFormat` or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb OK,thx ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]
rishabhreply opened a new issue, #10559: URL: https://github.com/apache/hudi/issues/10559 **Describe the problem you faced** It is not a problem but rather a question that I could not find in FAQs. Please let me know if it is unacceptable to ask here. I have data coming in multiple files (let's say 10 files) for one table and all will have same value in partition_column. My setup is state machine with Glue parallelization enabled. Lets say I have set a batch size=2 and concurrency=5 in state machine, this will mean the state machine will trigger 5 parallel glue job instances and give each instance 2 files to process. I am using **insert_overwrite** hudi method. Q1. In this setting how will Hudi work as not all glue job instances might finish at the same time? Will I see any Hudi errors? Or will it "overwrite" the data written by the glue job instances that finished earlier? **Environment Description** * Hudi version : * Spark version : * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 closed issue #10486: [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not URL: https://github.com/apache/hudi/issues/10486 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
danny0405 commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1907724040 Yeah, you should use `HoodieHiveInputFormat` or HoodieCombineHiveInputFormat. This is a Chinese doc that you can take a refeerence: https://www.yuque.com/yuzhao-my9fz/kb/kgv2rb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [Docs] updated button size so join now is on one line (#10557)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 539402c0638 [Docs] updated button size so join now is on one line (#10557) 539402c0638 is described below commit 539402c06387111b3e3ce8c243120e419de27d8e Author: nadine farah AuthorDate: Wed Jan 24 01:16:51 2024 -0800 [Docs] updated button size so join now is on one line (#10557) --- website/src/components/EventFeature/styles.module.css | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/src/components/EventFeature/styles.module.css b/website/src/components/EventFeature/styles.module.css index 416f746d6e5..ff6deb0db25 100644 --- a/website/src/components/EventFeature/styles.module.css +++ b/website/src/components/EventFeature/styles.module.css @@ -28,5 +28,5 @@ font-weight: bold; display: inline-block; text-align: center; - min-width: 230px + min-width: 280px } \ No newline at end of file
Re: [PR] updated button size so join now is on one line [hudi]
danny0405 merged PR #10557: URL: https://github.com/apache/hudi/pull/10557 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] is it possible to read/write hudi files with another programming language? [hudi]
schlichtanders commented on issue #7446: URL: https://github.com/apache/hudi/issues/7446#issuecomment-1907707785 Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming is super attractive for me. Neither delta-rs nor iceberg have it as far as I knew... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7311) Comparing date with date literal in string format causes class cast exception during filter push down
[ https://issues.apache.org/jira/browse/HUDI-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7311: - Fix Version/s: 1.0.0 > Comparing date with date literal in string format causes class cast exception > during filter push down > - > > Key: HUDI-7311 > URL: https://issues.apache.org/jira/browse/HUDI-7311 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.14.0, 0.14.1 >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Given any table with arbitrary field typed date (e.g. field d_date with type > of date). And execute the SQL with conditions for this field in where clause. > {code:sql} > select d_date from xxx where d_date = '2020-01-01' > {code} > An exception will occur: > {code:java} > Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to > java.lang.Integer > at > org.apache.hudi.source.ExpressionPredicates.toParquetPredicate(ExpressionPredicates.java:613) > at > org.apache.hudi.source.ExpressionPredicates.access$100(ExpressionPredicates.java:64) > at > org.apache.hudi.source.ExpressionPredicates$ColumnPredicate.filter(ExpressionPredicates.java:226) > at > org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68) > at > org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130) > at > org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66) > at > org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333) > {code} > Hudi Flink cannot convert the date literal in String format to Integer (the > primitive type of date). However this SQL in Flink without Hudi works well. > In summary, we should add literal type auto conversion before filter push > down. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7311] Add implicit literal type conversion before filter push down (#10531)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 77833cdb096 [HUDI-7311] Add implicit literal type conversion before filter push down (#10531) 77833cdb096 is described below commit 77833cdb09661b2cdac740520b51a29264afd9c7 Author: Paul Zhang AuthorDate: Wed Jan 24 17:15:07 2024 +0800 [HUDI-7311] Add implicit literal type conversion before filter push down (#10531) --- .../apache/hudi/source/ExpressionPredicates.java | 4 +- .../apache/hudi/util/ImplicitTypeConverter.java| 134 + .../hudi/source/TestExpressionPredicates.java | 61 ++ 3 files changed, 198 insertions(+), 1 deletion(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java index 8faf705a81f..58ee59a8176 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java @@ -26,6 +26,7 @@ import org.apache.flink.table.expressions.ValueLiteralExpression; import org.apache.flink.table.functions.BuiltInFunctionDefinitions; import org.apache.flink.table.functions.FunctionDefinition; import org.apache.flink.table.types.logical.LogicalType; +import org.apache.hudi.util.ImplicitTypeConverter; import org.apache.parquet.filter2.predicate.FilterPredicate; import org.apache.parquet.filter2.predicate.Operators; import org.slf4j.Logger; @@ -223,7 +224,8 @@ public class ExpressionPredicates { @Override public FilterPredicate filter() { - return toParquetPredicate(getFunctionDefinition(), literalType, columnName, literal); + Serializable convertedLiteral = ImplicitTypeConverter.convertImplicitly(literalType, literal); + return toParquetPredicate(getFunctionDefinition(), literalType, columnName, convertedLiteral); } /** diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java new file mode 100644 index 000..601b878655f --- /dev/null +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ImplicitTypeConverter.java @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.util; + +import org.apache.flink.table.types.logical.LogicalType; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.LocalTime; +import java.time.ZoneOffset; +import java.time.temporal.ChronoField; + +/** + * Implicit type converter for predicates push down. + */ +public class ImplicitTypeConverter { + + private static final Logger LOG = LoggerFactory.getLogger(ImplicitTypeConverter.class); + + /** + * Convert the literal to the corresponding type. + * @param literalType The type of the literal. + * @param literal The literal value. + * @return The converted literal. + */ + public static Serializable convertImplicitly(LogicalType literalType, Serializable literal) { +try { + switch (literalType.getTypeRoot()) { +case BOOLEAN: + if (literal instanceof Boolean) { +return literal; + } else { +return Boolean.valueOf(String.valueOf(literal)); + } +case TINYINT: +case SMALLINT: +case INTEGER: + if (literal instanceof Integer) { +return literal; + } else { +return Integer.valueOf(String.valueOf(literal)); + } +case BIGINT: + if (literal instanceof Long) { +return literal; + } else if (literal instanceof Integer) { +return new Long((Integer) literal); + } else { +return Long.valueOf(String.valueOf(lite
[jira] [Closed] (HUDI-7311) Comparing date with date literal in string format causes class cast exception during filter push down
[ https://issues.apache.org/jira/browse/HUDI-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7311. Resolution: Fixed Fixed via master branch: 77833cdb09661b2cdac740520b51a29264afd9c7 > Comparing date with date literal in string format causes class cast exception > during filter push down > - > > Key: HUDI-7311 > URL: https://issues.apache.org/jira/browse/HUDI-7311 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.14.0, 0.14.1 >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Given any table with arbitrary field typed date (e.g. field d_date with type > of date). And execute the SQL with conditions for this field in where clause. > {code:sql} > select d_date from xxx where d_date = '2020-01-01' > {code} > An exception will occur: > {code:java} > Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to > java.lang.Integer > at > org.apache.hudi.source.ExpressionPredicates.toParquetPredicate(ExpressionPredicates.java:613) > at > org.apache.hudi.source.ExpressionPredicates.access$100(ExpressionPredicates.java:64) > at > org.apache.hudi.source.ExpressionPredicates$ColumnPredicate.filter(ExpressionPredicates.java:226) > at > org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:68) > at > org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130) > at > org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66) > at > org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67) > at > org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333) > {code} > Hudi Flink cannot convert the date literal in String format to Integer (the > primitive type of date). However this SQL in Flink without Hudi works well. > In summary, we should add literal type auto conversion before filter push > down. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7311] Add implicit literal type conversion before filter push down [hudi]
danny0405 merged PR #10531: URL: https://github.com/apache/hudi/pull/10531 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org