Re: [PR] [HUDI-7303] Fix date field type unexpectedly convert to Long when usi… [hudi]
hudi-bot commented on PR #10517: URL: https://github.com/apache/hudi/pull/10517#issuecomment-1895276998 ## CI report: * 513d914fa72c497458c834d0b33962996b3d3e03 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]
hudi-bot commented on PR #10516: URL: https://github.com/apache/hudi/pull/10516#issuecomment-1895276911 ## CI report: * 8e44409db8f731627be1dbb55b7594bd94500e2f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]
xicm commented on issue #10486: URL: https://github.com/apache/hudi/issues/10486#issuecomment-1895237009 Seems a bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages
[ https://issues.apache.org/jira/browse/HUDI-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xy updated HUDI-7304: - Attachment: spark_metrics_messages.jpg > Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid > large mertric messages > - > > Key: HUDI-7304 > URL: https://issues.apache.org/jira/browse/HUDI-7304 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > Attachments: spark_metrics_messages.jpg > > > Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid > large mertric messages -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7303) Date field type unexpectedly convert to Long when using date comparison operator
[ https://issues.apache.org/jira/browse/HUDI-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7303: - Labels: pull-request-available (was: ) > Date field type unexpectedly convert to Long when using date comparison > operator > > > Key: HUDI-7303 > URL: https://issues.apache.org/jira/browse/HUDI-7303 > Project: Apache Hudi > Issue Type: Bug > Components: flink >Affects Versions: 0.14.0, 0.14.1 > Environment: Flink 1.15.4 Hudi 0.14.0 > Flink 1.17.1 Hudi 0.14.0 > Flink 1.17.1 Hudi 0.14.1rc1 >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > > Given the table date_dim from TPCDS as an example: > {code:java} > CREATE TABLE date_dim ( > d_date_sk int, > d_date_id varchar(16) NOT NULL, > d_date date, > d_month_seq int, > d_week_seq int, > d_quarter_seq int, > d_year int, > d_dow int, > d_moy int, > d_dom int, > d_qoy int, > d_fy_year int, > d_fy_quarter_seq int, > d_fy_week_seq int, > d_day_name varchar(9) > d_quarter_name varchar(6), > d_holiday char(1), > d_weekend char(1), > d_following_holiday char(1), > d_first_dom int, > d_last_dom int, > d_same_day_ly int, > d_same_day_lq int, > d_current_day char(1), > d_current_week char(1), > d_current_month char(1), > d_current_quarter char(1), > d_current_year char(1)) with ( > 'connector' = 'hudi', > 'path' = 'hdfs:///table_path/date_dim', > 'table.type' = 'COPY_ON_WRITE'); {code} > When you execute the following select statement, an exception will be thrown: > {code:java} > select * from date_dim where d_date between cast('1999-02-22' as date) and > (cast('1999-02-22' as date) + INTERVAL '30' day); > {code} > The exception is: > {code:java} > java.lang.IllegalArgumentException: FilterPredicate column: d_date's declared > type (java.lang.Long) does not match the schema found in file metadata. > Column d_date is of type: INT32 > Valid types for this column are: [class java.lang.Integer] > at > org.apache.parquet.filter2.predicate.ValidTypeMap.assertTypeValid(ValidTypeMap.java:125) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:179) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:113) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.Operators$GtEq.accept(Operators.java:246) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:119) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:306) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:67) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:142) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:153) > ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] > at > org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:78) > ~[hudi-flink1.17-bund
[PR] [HUDI-7303] Fix date field type unexpectedly convert to Long when usi… [hudi]
paul8263 opened a new pull request, #10517: URL: https://github.com/apache/hudi/pull/10517 …ng date comparison operator. ### Change Logs When using between, less than (less than or equal) or greater than (greater than or equal) operators with field typed of date, the date type will unexpected convert to Long, which is incompatible with its primitive type INT32. ### Impact No impact. ### Risk level (write none, low medium or high below) Low risk level. ### Documentation Update No need to update the documentation. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages
[ https://issues.apache.org/jira/browse/HUDI-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7304: - Labels: pull-request-available (was: ) > Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid > large mertric messages > - > > Key: HUDI-7304 > URL: https://issues.apache.org/jira/browse/HUDI-7304 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: xy >Assignee: xy >Priority: Major > Labels: pull-request-available > > Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid > large mertric messages -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]
xuzifu666 commented on PR #10516: URL: https://github.com/apache/hudi/pull/10516#issuecomment-1895199759 cc @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]
xuzifu666 opened a new pull request, #10516: URL: https://github.com/apache/hudi/pull/10516 ### Change Logs DataSourceInternalWriterHelper::onDataWriterCommit would print a large number of commit details and user not need it,it would interfere user,so change into to debug ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages
xy created HUDI-7304: Summary: Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages Key: HUDI-7304 URL: https://issues.apache.org/jira/browse/HUDI-7304 Project: Apache Hudi Issue Type: Improvement Components: spark Reporter: xy Assignee: xy Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]
ad1happy2go commented on issue #9918: URL: https://github.com/apache/hudi/issues/9918#issuecomment-1895063366 @victorxiang30 @Armelabdelkbir @watermelon12138 Can you provide the schema to help me to reproduce this. If it has complex data type, can you try setting spark config spark.hadoop.parquet.avro.write-old-list-structure as false. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]
danny0405 commented on issue #10503: URL: https://github.com/apache/hudi/issues/10503#issuecomment-1895010618 What is the requested file `6548b5aa910845504c7cdea4_1705406501315.795.csv`, it does not belongs to Hoodie data format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key [hudi]
ad1happy2go commented on issue #10303: URL: https://github.com/apache/hudi/issues/10303#issuecomment-1895009159 @srinikandi Sorry for the delay on this. I was able to reproduce the issue with Hudi version 0.12.1 and 0.14.1. We have introduced the config "hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled", you can set it to True. ``` public static final ConfigProperty KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED = ConfigProperty .key("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled") .defaultValue("false") .withDocumentation("When set to true, consistent value will be generated for a logical timestamp type column, " + "like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so " + "as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, " + "if it is kept disabled then record key of timestamp type with value `2016-12-29 09:54:00` will be written as timestamp " + "`2016-12-29 09:54:00.0` in row-writer path, while it will be written as long value `148302324000` in non row-writer path. " + "If enabled, then the timestamp value will be written in both the cases."); ``` Reproducible Code which works when we set the config. - ``` from faker import Faker import pandas as pd from pyspark.sql import SparkSession import pyspark.sql.functions as F #.. Fake Data Generation ... fake = Faker() data = [{"transactionId": fake.uuid4(), "EventTime": "2014-01-01 23:00:01","storeNbr" : "1", "FullName": fake.name(), "Address": fake.address(), "CompanyName": fake.company(), "JobTitle": fake.job(), "EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(), "RandomText": fake.sentence(), "City": fake.city(), "State": "NYC", "Country": "US"} for _ in range(5)] pandas_df = pd.DataFrame(data) hoodi_configs = { "hoodie.insert.shuffle.parallelism": "1", "hoodie.upsert.shuffle.parallelism": "1", "hoodie.bulkinsert.shuffle.parallelism": "1", "hoodie.delete.shuffle.parallelism": "1", "hoodie.datasource.write.row.writer.enable": "true", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.recordkey.field": "transactionId,storeNbr,EventTime", "hoodie.datasource.write.precombine.field": "Country", "hoodie.datasource.write.partitionpath.field": "State", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.combine.before.upsert": "true", "hoodie.table.name": "huditransaction", "hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled": "false", } spark.sparkContext.setLogLevel("WARN") df = spark.createDataFrame(pandas_df).withColumn("EventTime", expr("cast(EventTime as timestamp)")) df.write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","bulk_insert").mode("overwrite").save(PATH) spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False) df.withColumn("City",lit("updated_city")).write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","upsert").mode("append").save(PATH) spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False) ``` Let me know in case you need any more help on this. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping (#10493)
This is an automated email from the ASF dual-hosted git repository. stream2000 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new eae5d4ae8e6 [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping (#10493) eae5d4ae8e6 is described below commit eae5d4ae8e62014191fac76bbbeae0939f11100b Author: majian <47964462+majian1...@users.noreply.github.com> AuthorDate: Wed Jan 17 14:17:29 2024 +0800 [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping (#10493) * push down partition pruning filters when loading col stats index --- .../org/apache/hudi/ColumnStatsIndexSupport.scala | 14 ++-- .../scala/org/apache/hudi/HoodieFileIndex.scala| 37 ++ 2 files changed, 36 insertions(+), 15 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala index 9cdb15092b0..7a75c6c35ca 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala @@ -26,6 +26,7 @@ import org.apache.hudi.avro.model._ import org.apache.hudi.client.common.HoodieSparkEngineContext import org.apache.hudi.common.config.HoodieMetadataConfig import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.function.SerializableFunction import org.apache.hudi.common.model.HoodieRecord import org.apache.hudi.common.table.HoodieTableMetaClient import org.apache.hudi.common.util.BinaryUtil.toBytes @@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession, * * Please check out scala-doc of the [[transpose]] method explaining this view in more details */ - def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: Boolean)(block: DataFrame => T): T = { + def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: Boolean, prunedFileNames: Set[String] = Set.empty)(block: DataFrame => T): T = { cachedColumnStatsIndexViews.get(targetColumns) match { case Some(cachedDF) => block(cachedDF) case None => -val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = +val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if (prunedFileNames.isEmpty) { + // NOTE: Because some tests directly check this method and don't get prunedPartitionsAndFileSlices, we need to make sure these tests are correct. loadColumnStatsIndexRecords(targetColumns, shouldReadInMemory) +} else { + val filterFunction = new SerializableFunction[HoodieMetadataColumnStats, java.lang.Boolean] { +override def apply(r: HoodieMetadataColumnStats): java.lang.Boolean = { + prunedFileNames.contains(r.getFileName) +} + } + loadColumnStatsIndexRecords(targetColumns, shouldReadInMemory).filter(filterFunction) +} withPersistedData(colStatsRecords, StorageLevel.MEMORY_ONLY) { val (transposedRows, indexSchema) = transpose(colStatsRecords, targetColumns) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala index 709dfec183b..db8525be3d1 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala @@ -234,7 +234,7 @@ case class HoodieFileIndex(spark: SparkSession, //- Record-level Index is present //- List of predicates (filters) is present val candidateFilesNamesOpt: Option[Set[String]] = - lookupCandidateFilesInMetadataTable(dataFilters) match { + lookupCandidateFilesInMetadataTable(dataFilters, prunedPartitionsAndFileSlices) match { case Success(opt) => opt case Failure(e) => logError("Failed to lookup candidate files in File Index", e) @@ -316,11 +316,6 @@ case class HoodieFileIndex(spark: SparkSession, }) } - private def lookupFileNamesMissingFromIndex(allIndexedFileNames: Set[String]) = { -val allFileNames = getAllFiles().map(f => f.getPath.getName).toSet -allFileNames -- allIndexedFileNames - } - /** * Computes pruned list of candidate base-files' names based on provided list of {@link dataFilters} * conditions, by leveraging Metadata Table's Record Level Index and Column Statistics index (hereon referred as @@ -333,7 +328,7 @@ case class HoodieFileIndex(spark: SparkSession, * @param que
Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]
stream2000 merged PR #10493: URL: https://github.com/apache/hudi/pull/10493 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]
hudi-bot commented on PR #10515: URL: https://github.com/apache/hudi/pull/10515#issuecomment-1895004422 ## CI report: * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994) * 6d7150a24ab2169d780e5a98193144f5a16ad230 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21996) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]
hudi-bot commented on PR #10515: URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894997004 ## CI report: * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994) * 6d7150a24ab2169d780e5a98193144f5a16ad230 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DOCS] Add parquet merge schema config [hudi]
yihua commented on code in PR #10463: URL: https://github.com/apache/hudi/pull/10463#discussion_r1454666401 ## website/docs/configurations.md: ## @@ -1792,6 +1792,16 @@ Configurations controlling the behavior of Kafka source in Hudi Streamer. | [hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass) | io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is used by kafka client to deserialize the records.`Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS``Since Version: 0.9.0` | --- + Parquet DFS Source Configs {#Parquet-DFS-Source-Configs} Review Comment: Config page is automatically generated. Just to double check, did you use the [tool](https://github.com/apache/hudi/tree/asf-site/hudi-utils) to generate these changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7303) Date field type unexpectedly convert to Long when using date comparison operator
Yao Zhang created HUDI-7303: --- Summary: Date field type unexpectedly convert to Long when using date comparison operator Key: HUDI-7303 URL: https://issues.apache.org/jira/browse/HUDI-7303 Project: Apache Hudi Issue Type: Bug Components: flink Affects Versions: 0.14.1, 0.14.0 Environment: Flink 1.15.4 Hudi 0.14.0 Flink 1.17.1 Hudi 0.14.0 Flink 1.17.1 Hudi 0.14.1rc1 Reporter: Yao Zhang Assignee: Yao Zhang Given the table date_dim from TPCDS as an example: {code:java} CREATE TABLE date_dim ( d_date_sk int, d_date_id varchar(16) NOT NULL, d_date date, d_month_seq int, d_week_seq int, d_quarter_seq int, d_year int, d_dow int, d_moy int, d_dom int, d_qoy int, d_fy_year int, d_fy_quarter_seq int, d_fy_week_seq int, d_day_name varchar(9) d_quarter_name varchar(6), d_holiday char(1), d_weekend char(1), d_following_holiday char(1), d_first_dom int, d_last_dom int, d_same_day_ly int, d_same_day_lq int, d_current_day char(1), d_current_week char(1), d_current_month char(1), d_current_quarter char(1), d_current_year char(1)) with ( 'connector' = 'hudi', 'path' = 'hdfs:///table_path/date_dim', 'table.type' = 'COPY_ON_WRITE'); {code} When you execute the following select statement, an exception will be thrown: {code:java} select * from date_dim where d_date between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + INTERVAL '30' day); {code} The exception is: {code:java} java.lang.IllegalArgumentException: FilterPredicate column: d_date's declared type (java.lang.Long) does not match the schema found in file metadata. Column d_date is of type: INT32 Valid types for this column are: [class java.lang.Integer] at org.apache.parquet.filter2.predicate.ValidTypeMap.assertTypeValid(ValidTypeMap.java:125) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:179) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:113) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.Operators$GtEq.accept(Operators.java:246) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:119) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:306) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:67) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:142) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:153) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:78) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66) ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0] at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84) ~[flink-dist-1.17.1.jar:1.17.1] a
(hudi) branch master updated (108a885b4db -> d899fba9c71)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 108a885b4db [HUDI-7294] TVF to query hudi metadata (#10491) add d899fba9c71 Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." (#10514) No new revisions were added by this update. Summary of changes: .../common/table/timeline/HoodieInstantTimeGenerator.java | 4 .../common/table/timeline/TestHoodieActiveTimeline.java | 13 - 2 files changed, 17 deletions(-)
Re: [PR] [MINOR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]
yihua merged PR #10514: URL: https://github.com/apache/hudi/pull/10514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]
hudi-bot commented on PR #9640: URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894988743 ## CI report: * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN * d4c05ddde2295cf97a5b40edc3a7d62deca5a326 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21993) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]
gsudhanshu commented on issue #10503: URL: https://github.com/apache/hudi/issues/10503#issuecomment-1894985425 yes I am using pyspark 3.4.2 complete error log: ``` An error occurred while calling o208.load. : java.io.FileNotFoundException: File /var/www/maustats/primaryData/CD/6548b5aa910845504c7cdea4_1705406501315.795.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) at org.apache.hudi.common.util.TablePathUtils.getTablePath(TablePathUtils.java:58) at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:79) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:111) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFra meReader.scala:229) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(Cl ientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
maheshguptags commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894959705 @danny0405 can you please share the config to deduct the filegroup per-commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]
hudi-bot commented on PR #10515: URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894955959 ## CI report: * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]
hudi-bot commented on PR #9640: URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894955106 ## CI report: * 0c7300dbe529e40a4ce261032787843e241f2b45 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21966) * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN * d4c05ddde2295cf97a5b40edc3a7d62deca5a326 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]
hudi-bot commented on PR #10515: URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894950233 ## CI report: * b4df6b857e79dfb636e3af695d305e8ea50077cc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]
hudi-bot commented on PR #9640: URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894949364 ## CI report: * 0c7300dbe529e40a4ce261032787843e241f2b45 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21966) * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
hudi-bot commented on PR #10497: URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894943957 ## CI report: * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21988) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7302) Consistent Hashing row writer support sorting
[ https://issues.apache.org/jira/browse/HUDI-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7302: - Labels: pull-request-available (was: ) > Consistent Hashing row writer support sorting > - > > Key: HUDI-7302 > URL: https://issues.apache.org/jira/browse/HUDI-7302 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Qijun Fu >Priority: Major > Labels: pull-request-available > > Consistent Hashing row writer support sorting -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]
stream2000 opened a new pull request, #10515: URL: https://github.com/apache/hudi/pull/10515 ### Change Logs Consistent Hashing row writer support sorting ### Impact now consistent hashing clustering support sorting ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ NONE ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7302) Consistent Hashing row writer support sorting
Qijun Fu created HUDI-7302: -- Summary: Consistent Hashing row writer support sorting Key: HUDI-7302 URL: https://issues.apache.org/jira/browse/HUDI-7302 Project: Apache Hudi Issue Type: Improvement Reporter: Qijun Fu Consistent Hashing row writer support sorting -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]
codope closed issue #9826: [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running URL: https://github.com/apache/hudi/issues/9826 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]
ad1happy2go commented on issue #9826: URL: https://github.com/apache/hudi/issues/9826#issuecomment-1894915016 Closing this issue as 0.14.1 is realeased. Please reopen in case you see this issue again @zyclove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Solution for synchronizing the entire database table in flink [hudi]
ad1happy2go commented on issue #9965: URL: https://github.com/apache/hudi/issues/9965#issuecomment-1894912026 @bajiaolong Closing out this, Please reopen or create a new one for further queries. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] UPSERTs are taking time [hudi]
ad1happy2go commented on issue #9976: URL: https://github.com/apache/hudi/issues/9976#issuecomment-1894910852 @darlatrade Did the suggestion worked? DO you need any other help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]Flink writes MOR table, both RO table and RT table read nothing by hive [hudi]
ad1happy2go commented on issue #10465: URL: https://github.com/apache/hudi/issues/10465#issuecomment-1894908381 @nicholasxu They are deleted as part of cleaning process. We do need them for point in time queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]
hudi-bot commented on PR #10514: URL: https://github.com/apache/hudi/pull/10514#issuecomment-1894906031 ## CI report: * fb3087b8709a75b658f802b5c1d5fbcc7cfbbd65 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21992) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]
hudi-bot commented on PR #10514: URL: https://github.com/apache/hudi/pull/10514#issuecomment-1894900395 ## CI report: * fb3087b8709a75b658f802b5c1d5fbcc7cfbbd65 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]
linliu-code opened a new pull request, #10514: URL: https://github.com/apache/hudi/pull/10514 Reverts apache/hudi#10481 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Handle parsing of all zero timestamps with MDT suffixes. [hudi]
linliu-code commented on PR #10481: URL: https://github.com/apache/hudi/pull/10481#issuecomment-1894889506 @prashantwason, the test failure caused by this change keeps failing the master branch. Please revert this PR and fix it before resubmit it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF
[ https://issues.apache.org/jira/browse/HUDI-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7301: - Assignee: Vinaykumar Bhat > Update hudi docs/websites with documentation for the new spark TVF > -- > > Key: HUDI-7301 > URL: https://issues.apache.org/jira/browse/HUDI-7301 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > Hudi documentation and website needs to be updated to reflect the support for > new spark-sql related table-valued-functions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF
Vinaykumar Bhat created HUDI-7301: - Summary: Update hudi docs/websites with documentation for the new spark TVF Key: HUDI-7301 URL: https://issues.apache.org/jira/browse/HUDI-7301 Project: Apache Hudi Issue Type: Task Reporter: Vinaykumar Bhat Hudi documentation and website needs to be updated to reflect the support for new spark-sql related table-valued-functions -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]
danny0405 commented on PR #10389: URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894865885 Looks good to me, just take care of the test failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]
hudi-bot commented on PR #10389: URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894861662 ## CI report: * 248df7c04d611c5f521f309732aa21351161fa8b UNKNOWN * 0bd0b5188c73636a79d9d2b43a452497afa137f7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21989) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7270] Support schema evolution by Flink SQL using HoodieCatalog [hudi]
danny0405 commented on code in PR #10494: URL: https://github.com/apache/hudi/pull/10494#discussion_r1454426428 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalogUtil.java: ## @@ -172,4 +196,94 @@ public static List getOrderedPartitionValues( return values; } + + protected static void alterTable( Review Comment: Can we give some doc to this method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7270] Support schema evolution by Flink SQL using HoodieCatalog [hudi]
danny0405 commented on code in PR #10494: URL: https://github.com/apache/hudi/pull/10494#discussion_r1454427215 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalogUtil.java: ## @@ -172,4 +196,94 @@ public static List getOrderedPartitionValues( return values; } + + protected static void alterTable( + AbstractCatalog catalog, + ObjectPath tablePath, + CatalogBaseTable newCatalogTable, + List tableChanges, + boolean ignoreIfNotExists, + org.apache.hadoop.conf.Configuration hadoopConf, + BiFunction inferTablePathFunc, + BiConsumer postAlterTableFunc) throws TableNotExistException, CatalogException { +checkNotNull(tablePath, "Table path cannot be null"); +checkNotNull(newCatalogTable, "New catalog table cannot be null"); + +if (!isUpdatePermissible(catalog, tablePath, newCatalogTable, ignoreIfNotExists)) { + return; +} +if (!tableChanges.isEmpty()) { + CatalogBaseTable oldTable = catalog.getTable(tablePath); + HoodieFlinkWriteClient writeClient = createWriteClient(tablePath, oldTable, hadoopConf, inferTablePathFunc); + Pair pair = writeClient.getInternalSchemaAndMetaClient(); + InternalSchema oldSchema = pair.getLeft(); + Function convertFunc = (LogicalType logicalType) -> AvroInternalSchemaConverter.convertToField(AvroSchemaConverter.convertToSchema(logicalType)); + InternalSchema newSchema = Utils.applyTableChange(oldSchema, tableChanges, convertFunc); + if (!oldSchema.equals(newSchema)) { +writeClient.setOperationType(WriteOperationType.ALTER_SCHEMA); +writeClient.commitTableChange(newSchema, pair.getRight()); + } +} +postAlterTableFunc.accept(tablePath, newCatalogTable); + } + + protected static HoodieFlinkWriteClient createWriteClient( + ObjectPath tablePath, + CatalogBaseTable table, + org.apache.hadoop.conf.Configuration hadoopConf, + BiFunction inferTablePathFunc) { +Map options = table.getOptions(); +String tablePathStr = inferTablePathFunc.apply(tablePath, table); +return createWriteClient(options, tablePathStr, tablePath, hadoopConf); + } + + protected static HoodieFlinkWriteClient createWriteClient( + Map options, + String tablePathStr, + ObjectPath tablePath, + org.apache.hadoop.conf.Configuration hadoopConf) { +// enable auto-commit though ~ +options.put(HoodieWriteConfig.AUTO_COMMIT_ENABLE.key(), "true"); Review Comment: Not sure whether this is needed for all the scenarios? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]
hudi-bot commented on PR #10389: URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894856206 ## CI report: * 248df7c04d611c5f521f309732aa21351161fa8b UNKNOWN * 9aa9291d5b52c9801420505a91e60c92bf8439a2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21807) * 0bd0b5188c73636a79d9d2b43a452497afa137f7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894856440 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
hudi-bot commented on PR #10497: URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894856364 ## CI report: * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963) * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21988) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-7294) Add TVF to query hudi metadata
[ https://issues.apache.org/jira/browse/HUDI-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7294. - Fix Version/s: 1.0.0 Resolution: Done > Add TVF to query hudi metadata > -- > > Key: HUDI-7294 > URL: https://issues.apache.org/jira/browse/HUDI-7294 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Having a table valued function to query hudi metadata for a given table > through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-7294] TVF to query hudi metadata (#10491)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 108a885b4db [HUDI-7294] TVF to query hudi metadata (#10491) 108a885b4db is described below commit 108a885b4db62f08d30ede47805b8b44c35ab1e6 Author: bhat-vinay <152183592+bhat-vi...@users.noreply.github.com> AuthorDate: Wed Jan 17 08:21:07 2024 +0530 [HUDI-7294] TVF to query hudi metadata (#10491) Adds a TVF function to query hudi metadata through spark-sql. Since the metadata is already a MOR table, it simply creates a 'snapshot' on a MOR relation. Could not find any way to format (or filter) the RDD generated by the MOR snapshot relation. Uploading the PR to get some feedback. Co-authored-by: Vinaykumar Bhat --- .../sql/hudi/TestHoodieTableValuedFunction.scala | 68 ++ .../logcal/HoodieMetadataTableValuedFunction.scala | 46 +++ .../hudi/analysis/HoodieSpark32PlusAnalysis.scala | 17 +- .../sql/hudi/analysis/TableValuedFunctions.scala | 7 ++- 4 files changed, 136 insertions(+), 2 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala index 867e83c301e..bdf512d3451 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala @@ -21,6 +21,8 @@ import org.apache.hudi.DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION import org.apache.hudi.HoodieSparkUtils import org.apache.spark.sql.functions.{col, from_json} +import scala.collection.Seq + class TestHoodieTableValuedFunction extends HoodieSparkSqlTestBase { test(s"Test hudi_query Table-Valued Function") { @@ -558,4 +560,70 @@ class TestHoodieTableValuedFunction extends HoodieSparkSqlTestBase { } } } + + test(s"Test hudi_metadata Table-Valued Function") { +if (HoodieSparkUtils.gteqSpark3_2) { + withTempDir { tmp => +Seq("cow", "mor").foreach { tableType => + val tableName = generateTableName + val identifier = tableName + spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert") + spark.sql( +s""" + |create table $tableName ( + | id int, + | name string, + | ts long, + | price int + |) using hudi + |partitioned by (price) + |tblproperties ( + | type = '$tableType', + | primaryKey = 'id', + | preCombineField = 'ts', + | hoodie.datasource.write.recordkey.field = 'id', + | hoodie.metadata.record.index.enable = 'true', + | hoodie.metadata.index.column.stats.enable = 'true', + | hoodie.metadata.index.column.stats.column.list = 'price' + |) + |location '${tmp.getCanonicalPath}/$tableName' + |""".stripMargin + ) + + spark.sql( +s""" + | insert into $tableName + | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 3000, 30) + | """.stripMargin + ) + + val result2DF = spark.sql( +s"select type, key, filesystemmetadata from hudi_metadata('$identifier') where type=1" + ) + assert(result2DF.count() == 1) + + val result3DF = spark.sql( +s"select type, key, filesystemmetadata from hudi_metadata('$identifier') where type=2" + ) + assert(result3DF.count() == 3) + + val result4DF = spark.sql( +s"select type, key, ColumnStatsMetadata from hudi_metadata('$identifier') where type=3" + ) + assert(result4DF.count() == 3) + + val result5DF = spark.sql( +s"select type, key, recordIndexMetadata from hudi_metadata('$identifier') where type=5" + ) + assert(result5DF.count() == 3) + + val result6DF = spark.sql( +s"select type, key, BloomFilterMetadata from hudi_metadata('$identifier') where BloomFilterMetadata is not null" + ) + assert(result6DF.count() == 0) +} + } +} +spark.sessionState.conf.unsetConf(SPARK_SQL_INSERT_INTO_OPERATION.key) + } } diff --git a/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logcal/HoodieMetadataTableValuedFunction.scala b/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logcal/HoodieMetadataTableVa
Re: [PR] [HUDI-7294] TVF to query hudi metadata [hudi]
codope merged PR #10491: URL: https://github.com/apache/hudi/pull/10491 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch asf-site updated: [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types (#10510)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new a702ced7f0f [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types (#10510) a702ced7f0f is described below commit a702ced7f0f4e0e058ae0f0eaff28ec278f62fbf Author: Dipankar Mazumdar <103004148+dipankarmazum...@users.noreply.github.com> AuthorDate: Tue Jan 16 21:46:06 2024 -0500 [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types (#10510) * remaining diagrams * fixed issue with rollbacks page - Co-authored-by: Dipankar Mazumdar --- website/docs/clustering.md| 6 +++--- website/docs/rollbacks.md | 4 ++-- website/docs/table_types.md | 4 ++-- website/static/assets/images/COW_new.png | Bin 0 -> 1034864 bytes website/static/assets/images/MOR_new.png | Bin 0 -> 1342587 bytes .../assets/images/blog/clustering/clustering1_new.png | Bin 0 -> 1420549 bytes .../assets/images/blog/clustering/clustering2_new.png | Bin 0 -> 302821 bytes .../assets/images/blog/clustering/clustering_3.png| Bin 0 -> 513090 bytes .../assets/images/blog/rollbacks/Rollback_1.png | Bin 0 -> 311672 bytes .../assets/images/blog/rollbacks/rollback2_new.png| Bin 0 -> 569899 bytes 10 files changed, 7 insertions(+), 7 deletions(-) diff --git a/website/docs/clustering.md b/website/docs/clustering.md index 2feab1902ac..7749292b1cf 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -59,7 +59,7 @@ Clustering Service builds on Hudi’s MVCC based design to allow for writers to NOTE: Clustering can only be scheduled for tables / partitions not receiving any concurrent updates. In the future, concurrent updates use-case will be supported as well. -![Clustering example](/assets/images/blog/clustering/example_perf_improvement.png) +![Clustering example](/assets/images/blog/clustering/clustering1_new.png) _Figure: Illustrating query performance improvements by clustering_ ## Clustering Usecases @@ -71,7 +71,7 @@ such small files could lead to higher query latency. From our experience support few users who are using Hudi just for small file handling capabilities. So, you could employ clustering to batch a lot of such small files into larger ones. -![Batching small files](/assets/images/clustering_small_files.gif) +![Batching small files](/assets/images/blog/clustering/clustering2_new.png) ### Cluster by sort key @@ -80,7 +80,7 @@ arrival time, while query predicates do not sit well with it. With clustering, y based on query predicates and so, your data skipping will be very efficient and your query can ignore scanning a lot of unnecessary data. -![Batching small files](/assets/images/clustering_sort.gif) +![Batching small files](/assets/images/blog/clustering/clustering_3.png) ## Clustering Strategies diff --git a/website/docs/rollbacks.md b/website/docs/rollbacks.md index 5a2ebf2a70b..c78b8f3b084 100644 --- a/website/docs/rollbacks.md +++ b/website/docs/rollbacks.md @@ -35,7 +35,7 @@ for any actions/commits that is not yet committed and that refers to partially f is triggered and all dirty data is cleaned up followed by cleaning up the commit instants from the timeline. -![An example illustration of single writer rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png) +![An example illustration of single writer rollbacks](/assets/images/blog/rollbacks/Rollback_1.png) _Figure 1: single writer with eager rollbacks_ @@ -63,7 +63,7 @@ information whether the writer that started the commit of interest is still maki the commit, the heartbeat file is deleted. Or if the write failed midway, the last modification time of the heartbeat file is no longer updated, so other writers can deduce the failed write after a period of time elapses. -![An example illustration of multi writer rollbacks](/assets/images/blog/rollbacks/multi_writer_rollback.png) +![An example illustration of multi writer rollbacks](/assets/images/blog/rollbacks/rollback2_new.png) _Figure 2: multi-writer with lazy cleaning of failed commits_ ## Related Resources diff --git a/website/docs/table_types.md b/website/docs/table_types.md index 28814d239e8..e280909a9f3 100644 --- a/website/docs/table_types.md +++ b/website/docs/table_types.md @@ -69,7 +69,7 @@ Following illustrates how this works conceptually, when data written into copy-o - + @@ -97,7 +97,7 @@ their columnar base file, to keep the query performance in check (larger delta l Following illustrates how the table works, and shows two types of queries - snapshot query and read optimized query. - + There are lot of interesting things happening in this example
Re: [PR] [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types [hudi]
danny0405 merged PR #10510: URL: https://github.com/apache/hudi/pull/10510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7299] BucketIndex table should forbit append mode [hudi]
danny0405 commented on code in PR #10505: URL: https://github.com/apache/hudi/pull/10505#discussion_r1454399288 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java: ## @@ -111,7 +111,7 @@ public class Pipelines { */ public static DataStreamSink bulkInsert(Configuration conf, RowType rowType, DataStream dataStream) { WriteOperatorFactory operatorFactory = BulkInsertWriteOperator.getFactory(conf, rowType); -if (OptionsResolver.isBucketIndexType(conf)) { +if (!OptionsResolver.isAppendMode(conf) && OptionsResolver.isBucketIndexType(conf)) { Review Comment: In `HoodieTableSink`, the append mode has the first priority, that means an append only table would never take the bucket index into effect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Migration partitionned table with complex key generator to 0.14.1 leads to duplicates when recordkey length =1 [hudi]
danny0405 commented on issue #10508: URL: https://github.com/apache/hudi/issues/10508#issuecomment-1894845482 Yeah, this is a mistake, we should not include this for 0.14.1 release, it is intended for 1.0.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]
danny0405 commented on issue #10503: URL: https://github.com/apache/hudi/issues/10503#issuecomment-1894844589 Are you using py-spark, looks like a bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Fix eager rollback mdt ut (#10506)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 163053408f2 [MINOR] Fix eager rollback mdt ut (#10506) 163053408f2 is described below commit 163053408f258c16085ce6bc7c11eccd2319a491 Author: KnightChess <981159...@qq.com> AuthorDate: Wed Jan 17 10:38:27 2024 +0800 [MINOR] Fix eager rollback mdt ut (#10506) Signed-off-by: wulingqi <981159...@qq.com> --- .../java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java index 42242fdfa32..a44d98c4f8b 100644 --- a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java +++ b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java @@ -1534,8 +1534,8 @@ public class TestJavaHoodieBackedMetadata extends TestHoodieMetadataBase { fileStatus.getPath().getName().equals(rollbackInstant.getFileName())).collect(Collectors.toList()); // ensure commit3's delta commit in MDT has last mod time > the actual rollback for previous failed commit i.e. commit2. -// if rollback wasn't eager, rollback's last mod time will be lower than the commit3'd delta commit last mod time. -assertTrue(commit3Files.get(0).getModificationTime() > rollbackFiles.get(0).getModificationTime()); +// if rollback wasn't eager, rollback's last mod time will be not larger than the commit3'd delta commit last mod time. +assertTrue(commit3Files.get(0).getModificationTime() >= rollbackFiles.get(0).getModificationTime()); client.close(); }
Re: [PR] [MINOR] fix eager rollback mdt ut [hudi]
danny0405 merged PR #10506: URL: https://github.com/apache/hudi/pull/10506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]
linliu-code commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894842428 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]
danny0405 commented on issue #10456: URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894838453 Yeah, try to deduct the number of file groups per-commit, because for each file group, we have a in-memory buffer before flushing into disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7001] ComplexAvroKeyGenerator should represent single record key as the value string without composing the key field name [hudi]
danny0405 commented on code in PR #9936: URL: https://github.com/apache/hudi/pull/9936#discussion_r1454379931 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/RowDataKeyGen.java: ## @@ -99,7 +99,7 @@ protected RowDataKeyGen( this.recordKeyProjection = null; } else { this.recordKeyFields = recordKeys.get().split(","); - if (this.recordKeyFields.length == 1) { + if (this.recordKeyFields.length == 1 && this.partitionPathFields.length == 1) { Review Comment: Are you using 0.14.1? 0.14.0 should not include this commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
paul8263 commented on code in PR #10497: URL: https://github.com/apache/hudi/pull/10497#discussion_r1454374435 ## hudi-flink-datasource/hudi-flink1.14.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java: ## @@ -460,59 +460,59 @@ private static WritableColumnVector createWritableColumnVector( case BOOLEAN: checkArgument( typeName == PrimitiveType.PrimitiveTypeName.BOOLEAN, -"Unexpected type: %s", typeName); +"Unexpected type exception. Primitive type: %s. Field type: %s.", typeName, fieldType.getTypeRoot().name()); Review Comment: I extracted it as a static method. The code for error message construction won't be duplicated too many times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
paul8263 commented on code in PR #10497: URL: https://github.com/apache/hudi/pull/10497#discussion_r1454372097 ## hudi-flink-datasource/hudi-flink1.14.x/src/main/java/org/apache/hudi/table/format/cow/vector/reader/ParquetColumnarRowSplitReader.java: ## @@ -218,11 +218,17 @@ private WritableColumnVector[] createWritableVectors() { List types = requestedSchema.getFields(); List descriptors = requestedSchema.getColumns(); for (int i = 0; i < requestedTypes.length; i++) { - columns[i] = createWritableColumnVector( - batchSize, - requestedTypes[i], - types.get(i), - descriptors); + String fieldName = requestedSchema.getFieldName(i); Review Comment: Hi @danny0405 , Correct. It should be moved to the catch block as it would be only needed if there was an exception. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
hudi-bot commented on PR #10497: URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894823533 ## CI report: * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963) * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]
hudi-bot commented on PR #10497: URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894817105 ## CI report: * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963) * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Fix a unit test [hudi]
hudi-bot commented on PR #10513: URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894809514 ## CI report: * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21986) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]
yihua merged PR #10199: URL: https://github.com/apache/hudi/pull/10199 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7300] Merge schema in ParuqetDFSSource (#10199)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new ca5d4685a00 [HUDI-7300] Merge schema in ParuqetDFSSource (#10199) ca5d4685a00 is described below commit ca5d4685a002a3b3da917f6b195e27dcb20d7316 Author: Rohit Mittapalli AuthorDate: Tue Jan 16 17:52:07 2024 -0800 [HUDI-7300] Merge schema in ParuqetDFSSource (#10199) --- .../utilities/config/ParquetDFSSourceConfig.java | 49 ++ .../hudi/utilities/sources/ParquetDFSSource.java | 6 ++- 2 files changed, 54 insertions(+), 1 deletion(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java new file mode 100644 index 000..b3bf5678baf --- /dev/null +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + + public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty + .key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.merge_schema.enable") + .defaultValue(false) + .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + "source.parquet.dfs.merge_schema.enable") + .markAdvanced() + .sinceVersion("1.0.0") + .withDocumentation("Merge schema across parquet files within a single write"); +} diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java index a56a878f1fe..a3ee555ec5a 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java @@ -21,6 +21,7 @@ package org.apache.hudi.utilities.sources; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.utilities.config.ParquetDFSSourceConfig; import org.apache.hudi.utilities.schema.SchemaProvider; import org.apache.hudi.utilities.sources.helpers.DFSPathSelector; @@ -29,6 +30,8 @@ import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; +import static org.apache.hudi.common.util.ConfigUtils.getBooleanWithAltKeys; + /** * DFS Source that reads parquet data. */ @@ -52,6 +55,7 @@ public class ParquetDFSSource extends RowSource { } private Dataset fromFiles(String pathStr) { -return sparkSession.read().parquet(pathStr.split(",")); +boolean mergeSchemaOption = getBooleanWithAltKeys(this.props, ParquetDFSSourceConfig.PARQUET_DFS_MERGE_SCHEMA); +return sparkSession.read().option("mergeSchema", mergeSchemaOption).parquet(pathStr.split(",")); } }
Re: [PR] [HUDI-6902] Fix a unit test [hudi]
hudi-bot commented on PR #10513: URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894774632 ## CI report: * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21986) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Fix a unit test [hudi]
hudi-bot commented on PR #10513: URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894767088 ## CI report: * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894767056 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894760625 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6902] Fix a unit test [hudi]
linliu-code opened a new pull request, #10513: URL: https://github.com/apache/hudi/pull/10513 ### Change Logs As title. ### Impact None. ### Risk level (write none, low medium or high below) None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6902] Run Azure tests on different agents [hudi]
hudi-bot commented on PR #10512: URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894713292 ## CI report: * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Hudi Delete Partition on AWS Glue [hudi]
soumilshah1995 commented on issue #8894: URL: https://github.com/apache/hudi/issues/8894#issuecomment-1894700142 hey buddy depends on how you have partitioned your tables if you have partitioned tables with hive style state='Connecticut. should work lets connect on slack for more details :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6902] Run Azure tests on different agents [hudi]
linliu-code opened a new pull request, #10512: URL: https://github.com/apache/hudi/pull/10512 ### Change Logs Create a agent pool for each job. ### Impact Isolate each job. ### Risk level (write none, low medium or high below) None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]
hudi-bot commented on PR #10199: URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894665261 ## CI report: * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878) * b8158aa597e89aae3e83bb650bd07847a3f28dd3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21983) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7247]Spark truncate table supports concurrency [hudi]
bvaradar commented on code in PR #10390: URL: https://github.com/apache/hudi/pull/10390#discussion_r1454176491 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/TruncateHoodieTableCommand.scala: ## @@ -68,7 +71,12 @@ case class TruncateHoodieTableCommand( val targetPath = new Path(basePath) val engineContext = new HoodieSparkEngineContext(sparkSession.sparkContext) val fs = FSUtils.getFs(basePath, sparkSession.sparkContext.hadoopConfiguration) + val hoodieWriteConfig = HoodieWriteConfig.newBuilder().withPath(basePath).withProps(properties).withEngineType(EngineType.SPARK) +.build() + val transactionManager = new TransactionManager(hoodieWriteConfig, fs) + transactionManager.beginTransaction(org.apache.hudi.common.util.Option.empty(), org.apache.hudi.common.util.Option.empty()) FSUtils.deleteDir(engineContext, fs, targetPath, sparkSession.sparkContext.defaultParallelism) + transactionManager.endTransaction(org.apache.hudi.common.util.Option.empty()) Review Comment: +1 on using replace commit. This will be truly revertible and aligns with other operations. @waywtdcc : Can you make this change ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas
[ https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7300: - Labels: pull-request-available (was: ) > Parquet DFS source should support merging schemas > - > > Key: HUDI-7300 > URL: https://issues.apache.org/jira/browse/HUDI-7300 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rohit Mittapalli >Priority: Minor > Labels: pull-request-available > Original Estimate: 24h > Remaining Estimate: 24h > > We should surface the option to merge schema across the parquet files in a > single commit. when using ParquetDFSSource. > > When false the schema is randomly picked from a parquet file (current > behavior). When set to true the schema across a commit is merged. > > https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]
hudi-bot commented on PR #10199: URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894656511 ## CI report: * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878) * b8158aa597e89aae3e83bb650bd07847a3f28dd3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas
[ https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mittapalli updated HUDI-7300: --- Status: In Progress (was: Open) > Parquet DFS source should support merging schemas > - > > Key: HUDI-7300 > URL: https://issues.apache.org/jira/browse/HUDI-7300 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Rohit Mittapalli >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > We should surface the option to merge schema across the parquet files in a > single commit. when using ParquetDFSSource. > > When false the schema is randomly picked from a parquet file (current > behavior). When set to true the schema across a commit is merged. > > https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7300) Parquet DFS source should support merging schemas
Rohit Mittapalli created HUDI-7300: -- Summary: Parquet DFS source should support merging schemas Key: HUDI-7300 URL: https://issues.apache.org/jira/browse/HUDI-7300 Project: Apache Hudi Issue Type: Improvement Reporter: Rohit Mittapalli We should surface the option to merge schema across the parquet files in a single commit. when using ParquetDFSSource. When false the schema is randomly picked from a parquet file (current behavior). When set to true the schema across a commit is merged. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] Handle parsing of all zero timestamps with MDT suffixes. [hudi]
bvaradar merged PR #10481: URL: https://github.com/apache/hudi/pull/10481 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Handle parsing of all zero timestamps with MDT suffixes. (#10481)
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 68e703e3a49 [MINOR] Handle parsing of all zero timestamps with MDT suffixes. (#10481) 68e703e3a49 is described below commit 68e703e3a4987a1d9ec6e20fae0ad7436f77bd3c Author: Prashant Wason AuthorDate: Tue Jan 16 14:49:57 2024 -0800 [MINOR] Handle parsing of all zero timestamps with MDT suffixes. (#10481) --- .../common/table/timeline/HoodieInstantTimeGenerator.java | 4 .../common/table/timeline/TestHoodieActiveTimeline.java | 13 + 2 files changed, 17 insertions(+) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java index 2e48e40820d..3fb9a0698b6 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java @@ -90,6 +90,10 @@ public class HoodieInstantTimeGenerator { LocalDateTime dt = LocalDateTime.parse(timestampInMillis, MILLIS_INSTANT_TIME_FORMATTER); return Date.from(dt.atZone(ZoneId.systemDefault()).toInstant()); } catch (DateTimeParseException e) { + // MDT uses timestamps which add suffixes to the instant time. Hence, we are checking for all timestamps that start with all zeros. + if (timestamp.startsWith(HoodieTimeline.INIT_INSTANT_TS)) { +return new Date(0); + } throw new ParseException(e.getMessage(), e.getErrorIndex()); } } diff --git a/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java b/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java index ce0b5dad335..847d7d9e7b9 100755 --- a/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java +++ b/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java @@ -609,6 +609,19 @@ public class TestHoodieActiveTimeline extends HoodieCommonTestHarness { System.out.println(defaultSecsGranularityDate.getTime()); } + @Test + public void testAllZeroTimestampParsing() throws ParseException { +String allZeroTs = "00"; +Date allZeroDate = HoodieActiveTimeline.parseDateFromInstantTime(allZeroTs); +assertEquals(allZeroDate, new Date(0), "Parsing of all zero timestamp should succeed"); + +// MDT uses timestamps which add suffixes to the instant time. These should also be parsable for all zero case. +for (int index = 0; index < 10; ++index) { + allZeroDate = HoodieActiveTimeline.parseDateFromInstantTime(allZeroTs + "00" + index); + assertEquals(allZeroDate, new Date(0), "Parsing of all zero timestamp should succeed"); +} + } + @Test public void testMetadataCompactionInstantDateParsing() throws ParseException { // default second granularity instant ID
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
rohitmittapalli commented on code in PR #10199: URL: https://github.com/apache/hudi/pull/10199#discussion_r1454158271 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.defaultValue(true) Review Comment: fine by me! will set to false by default then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
xushiyan commented on code in PR #10199: URL: https://github.com/apache/hudi/pull/10199#discussion_r1454154841 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.defaultValue(true) Review Comment: ![Screenshot 2024-01-16 at 4 38 21 PM](https://github.com/apache/hudi/assets/2701446/9c6730f8-e9f1-41ab-988c-f6242ec8e523) did a quick check on the doc so it's default false. setting this true will introduce behavior changes. we should keep it BWC in pre 1.0 releases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
yihua commented on code in PR #10199: URL: https://github.com/apache/hudi/pull/10199#discussion_r1454147802 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") Review Comment: Avoid camelCase in the config naming. use `.enable_merge_schema` instead. ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.defaultValue(true) +.withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.markAdvanced() Review Comment: add `sinceVersion("1.0.0")` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
rohitmittapalli commented on PR #10199: URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894619834 > @rohitmittapalli can you also file a jira and update the title with the jira id pls? Requested a JIRA account unable to file until that gets approved -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Clean default Hadoop configuration values in tests [hudi]
vinothchandar merged PR #10495: URL: https://github.com/apache/hudi/pull/10495 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Clean default Hadoop configuration values in tests (#10495)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 32ade368d89 [MINOR] Clean default Hadoop configuration values in tests (#10495) 32ade368d89 is described below commit 32ade368d899ede5c8e7854863945864604b5692 Author: Lin Liu <141371752+linliu-c...@users.noreply.github.com> AuthorDate: Tue Jan 16 14:24:23 2024 -0800 [MINOR] Clean default Hadoop configuration values in tests (#10495) * [MINOR] Clean default Hadoop configurations for SparkContext These default Hadoop configurations are not used in Hudi tests. * Consolidating the code into a helper class - Co-authored-by: vinoth chandar --- .../org/apache/hudi/testutils/HoodieClientTestUtils.java | 14 ++ .../hudi/testutils/HoodieSparkClientTestHarness.java | 9 ++--- .../hudi/testutils/SparkClientFunctionalTestHarness.java | 1 + 3 files changed, 21 insertions(+), 3 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java index 991c615c35d..55619a2a24b 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java @@ -53,6 +53,7 @@ import org.apache.hadoop.hbase.io.hfile.CacheConfig; import org.apache.hadoop.hbase.io.hfile.HFile; import org.apache.hadoop.hbase.io.hfile.HFileScanner; import org.apache.spark.SparkConf; +import org.apache.spark.SparkContext; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; @@ -61,6 +62,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; +import java.lang.reflect.Field; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; @@ -125,6 +127,18 @@ public class HoodieClientTestUtils { return SparkRDDReadClient.addHoodieSupport(sparkConf); } + public static void overrideSparkHadoopConfiguration(SparkContext sparkContext) { +try { + // Clean the default Hadoop configurations since in our Hudi tests they are not used. + Field hadoopConfigurationField = sparkContext.getClass().getDeclaredField("_hadoopConfiguration"); + hadoopConfigurationField.setAccessible(true); + Configuration testHadoopConfig = new Configuration(false); + hadoopConfigurationField.set(sparkContext, testHadoopConfig); +} catch (NoSuchFieldException | IllegalAccessException e) { + LOG.warn(e.getMessage()); +} + } + private static HashMap getLatestFileIDsToFullPath(String basePath, HoodieTimeline commitTimeline, List commitsToReturn) throws IOException { HashMap fileIdToFullPath = new HashMap<>(); diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java index 2a83baa018c..59cfcb4bb6d 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java @@ -69,6 +69,8 @@ import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocalFileSystem; import org.apache.hadoop.fs.Path; +import org.apache.spark.SparkConf; +import org.apache.spark.SparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.SQLContext; @@ -191,11 +193,12 @@ public abstract class HoodieSparkClientTestHarness extends HoodieWriterClientTes } // Initialize a local spark env -jsc = new JavaSparkContext(HoodieClientTestUtils.getSparkConfForTest(appName + "#" + testMethodName)); +SparkConf sc = HoodieClientTestUtils.getSparkConfForTest(appName + "#" + testMethodName); +SparkContext sparkContext = new SparkContext(sc); +HoodieClientTestUtils.overrideSparkHadoopConfiguration(sparkContext); +jsc = new JavaSparkContext(sparkContext); jsc.setLogLevel("ERROR"); - hadoopConf = jsc.hadoopConfiguration(); - sparkSession = SparkSession.builder() .withExtensions(JFunction.toScala(sparkSessionExtensions -> { sparkSessionExtensionsInjector.ifPresent(injector -> injector.accept(sparkSessionExtensions)); diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
rohitmittapalli commented on code in PR #10199: URL: https://github.com/apache/hudi/pull/10199#discussion_r1454133265 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.defaultValue(true) Review Comment: I've set default to true as per @nsivabalan's request here: https://github.com/apache/hudi/pull/10199#discussion_r1408722685 Essentially the key difference is that the schema will be merged across all the parquet files in the commit, in the past the schema would be inherited by the first file in the commit. In my opinion, this should be the default case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
xushiyan commented on PR #10199: URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894614464 @rohitmittapalli can you also file a jira and update the title with the jira id pls? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Merge schema in ParuqetDFSSource [hudi]
xushiyan commented on code in PR #10199: URL: https://github.com/apache/hudi/pull/10199#discussion_r1454129825 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java: ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.config; + +import org.apache.hudi.common.config.ConfigClassProperty; +import org.apache.hudi.common.config.ConfigGroups; +import org.apache.hudi.common.config.ConfigProperty; +import org.apache.hudi.common.config.HoodieConfig; + +import javax.annotation.concurrent.Immutable; + +import static org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX; +import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX; + +/** + * Parquet DFS Source Configs + */ +@Immutable +@ConfigClassProperty(name = "Parquet DFS Source Configs", +groupName = ConfigGroups.Names.HUDI_STREAMER, +subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE, +description = "Configurations controlling the behavior of Parquet DFS source in Hudi Streamer.") +public class ParquetDFSSourceConfig extends HoodieConfig { + +public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = ConfigProperty +.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema") +.defaultValue(true) Review Comment: can you clarify by setting this default to true, what is the impact to existing pipelines that using this DFS source? should it be false by default to be compatible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]
vinothchandar commented on code in PR #10492: URL: https://github.com/apache/hudi/pull/10492#discussion_r1454069459 ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionQuick.java: ## @@ -59,25 +59,34 @@ public void teardown() throws Exception { } protected static Stream testArgs() { +boolean fullTest = false; Stream.Builder b = Stream.builder(); -//only testing row-writer enabled for now -for (Boolean rowWriterEnable : new Boolean[] {true}) { - for (Boolean nullForDeletedCols : new Boolean[] {false, true}) { -for (Boolean useKafkaSource : new Boolean[] {false, true}) { - for (Boolean addFilegroups : new Boolean[] {false, true}) { -for (Boolean multiLogFiles : new Boolean[] {false, true}) { - for (Boolean shouldCluster : new Boolean[] {false, true}) { -for (String tableType : new String[] {"COPY_ON_WRITE", "MERGE_ON_READ"}) { - if (!multiLogFiles || tableType.equals("MERGE_ON_READ")) { -b.add(Arguments.of(tableType, shouldCluster, false, rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, nullForDeletedCols)); +if (fullTest) { + //only testing row-writer enabled for now + for (Boolean rowWriterEnable : new Boolean[] {true}) { +for (Boolean nullForDeletedCols : new Boolean[] {false, true}) { + for (Boolean useKafkaSource : new Boolean[] {false, true}) { +for (Boolean addFilegroups : new Boolean[] {false, true}) { + for (Boolean multiLogFiles : new Boolean[] {false, true}) { +for (Boolean shouldCluster : new Boolean[] {false, true}) { + for (String tableType : new String[] {"COPY_ON_WRITE", "MERGE_ON_READ"}) { +if (!multiLogFiles || tableType.equals("MERGE_ON_READ")) { + b.add(Arguments.of(tableType, shouldCluster, false, rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, nullForDeletedCols)); +} } } +b.add(Arguments.of("MERGE_ON_READ", false, true, rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, nullForDeletedCols)); } - b.add(Arguments.of("MERGE_ON_READ", false, true, rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, nullForDeletedCols)); } } } } +} else { Review Comment: ``` String tableType = COW, MOR Boolean shouldCluster = true Boolean shouldCompact = true Boolean rowWriterEnable = true Boolean addFilegroups = true Boolean multiLogFiles = true Boolean useKafkaSource= false, true Boolean allowNullForDeletedCols=false,true ``` I wonder if we just do sth like this. with new file groups, multiple log files, alongside cluster and compaction, should be the more complex (superset) scenario. no? ## hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionQuick.java: ## @@ -97,19 +106,27 @@ protected static Stream testReorderedColumn() { } protected static Stream testParamsWithSchemaTransformer() { +boolean fullTest = false; Stream.Builder b = Stream.builder(); -for (Boolean useTransformer : new Boolean[] {false, true}) { - for (Boolean setSchema : new Boolean[] {false, true}) { -for (Boolean rowWriterEnable : new Boolean[] {true}) { - for (Boolean nullForDeletedCols : new Boolean[] {false, true}) { -for (Boolean useKafkaSource : new Boolean[] {false, true}) { - for (String tableType : new String[] {"COPY_ON_WRITE", "MERGE_ON_READ"}) { -b.add(Arguments.of(tableType, rowWriterEnable, useKafkaSource, nullForDeletedCols, useTransformer, setSchema)); +if (fullTest) { + for (Boolean useTransformer : new Boolean[] {false, true}) { +for (Boolean setSchema : new Boolean[] {false, true}) { + for (Boolean rowWriterEnable : new Boolean[] {true}) { +for (Boolean nullForDeletedCols : new Boolean[] {false, true}) { + for (Boolean useKafkaSource : new Boolean[] {false, true}) { +for (String tableType : new String[] {"COPY_ON_WRITE", "MERGE_ON_READ"}) { + b.add(Arguments.of(tableType, rowWriterEnable, useKafkaSource, nullForDeletedCols, useTransformer, setSchema)); +} } } } } } +} else
(hudi) branch master updated (744f2a1b6c0 -> df6e351f31c)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 744f2a1b6c0 [HUDI-7286] Flink get hudi index type ignore case sensitive (#10476) add df6e351f31c [HUDI-6092] Set the timeout for the forked JVM (#10496) No new revisions were added by this update. Summary of changes: pom.xml | 1 + 1 file changed, 1 insertion(+)
Re: [PR] [HUDI-6092] Set the timeout for the forked JVM for tests [hudi]
vinothchandar merged PR #10496: URL: https://github.com/apache/hudi/pull/10496 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]
bk-mz opened a new issue, #10511: URL: https://github.com/apache/hudi/issues/10511 **Describe the problem you faced** We encountered an issue with MOR table that utilizes metadata bloom filters and Parquet bloom filters, and has enabled statistics. When attempting to query data, the system does not seem to utilize these bloom filters effectively. Instead, all requests result in a full partition scan, regardless of the applied filters. **To Reproduce** Steps to reproduce the behavior: 1. Create a MOR table and write data using both Parquet bloom filters and metadata bloom filters. 2. Attempt to query the data by applying a filter to one of the columns that participate in bloom filtering. Ensure that the filter narrows down the dataset size, making the bloom filters more likely to be effective. 3. Observe that the Spark SQL User Interface (UI) displays a full partition scan. 4. Compare the query latency time for the column with bloom filters (BF) to the latency time for the column without bloom filters (non-BF). **Expected behavior** The expected behavior is that querying the column with bloom filters (BF) should be significantly more efficient than querying the column without bloom filters (non-BF). **Environment Description** * Hudi version : 0.14.0 * Spark version : 3.5.0 AWS EMR 7.0.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Table write hudi params: ```properties hoodie.bloom.index.filter.type=DYNAMIC_V0 hoodie.bloom.index.prune.by.ranges=false hoodie.bloom.index.use.metadata=true hoodie.clean.async=true hoodie.cleaner.policy.failed.writes=LAZY hoodie.compact.inline.max.delta.commits=5 hoodie.datasource.hive_sync.database=db_name hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.mode=hms hoodie.datasource.hive_sync.partition_fields=year,month,day,hour hoodie.datasource.hive_sync.table=table_name hoodie.datasource.hive_sync.use_jdbc=false hoodie.datasource.write.hive_style_partitioning=true hoodie.datasource.write.partitionpath.field=year,month,day,hour hoodie.datasource.write.path=s3://s3_path/table hoodie.datasource.write.precombine.field=date_updated_epoch hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.streaming.checkpoint.identifier=main_writer hoodie.datasource.write.table.type=MERGE_ON_READ hoodie.enable.data.skipping=true hoodie.index.type=BLOOM hoodie.metadata.enable=true hoodie.metadata.index.async=true hoodie.metadata.index.bloom.filter.column.list=id,account_id hoodie.metadata.index.bloom.filter.enable=true hoodie.metadata.index.column.stats.column.list=id,account_id hoodie.metadata.index.column.stats.enable=true hoodie.metricscompaction.log.blocks.on=true hoodie.table.name=table_name hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.write.lock.dynamodb.partition_key=table_name hoodie.write.lock.dynamodb.region=us-east-1 hoodie.write.lock.dynamodb.table=hudi-lock hoodie.write.lock.num_retries=30 hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider hoodie.write.lock.wait_time_ms=3 hoodie.write.lock.wait_time_ms_between_retry=1 ``` Hadoop parquet properties: ```properties parquet.avro.write-old-list-structure=false parquet.bloom.filter.enabled#account_id=true parquet.bloom.filter.enabled#id=true ``` If I download the file from s3 and then use parquet cli, it will show that BF on column is actually used: ``` parquet bloom-filter fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet -c account_id -v account_id1 Row group 0: value account_id1 NOT exists. parquet bloom-filter fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet -c account_id -v account_id2 Row group 0: value account_id2 maybe exists. ``` Read part: ``` $ spark-sql \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \ --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \ --jars=/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \ --conf spark.executor.cores=8 \ --conf spark.executor.memory=27G \ --conf spark.driver.cores=8 \ --conf spark.driver.memory=27G``` spark-sql (default)> select count(1) as cnt from table_with_bfs where year = 2024 and month = 1 and day = 5 and account_id = 'id1'; 82 Time taken: 34.962 seconds, Fetch
Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]
linliu-code commented on PR #10492: URL: https://github.com/apache/hudi/pull/10492#issuecomment-1894464626 @jonvex, when is "fullTest" set to "true"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org