date:20240116

Re: [PR] [HUDI-7303] Fix date field type unexpectedly convert to Long when usi… [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10517:
URL: https://github.com/apache/hudi/pull/10517#issuecomment-1895276998

   
   ## CI report:
   
   * 513d914fa72c497458c834d0b33962996b3d3e03 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10516:
URL: https://github.com/apache/hudi/pull/10516#issuecomment-1895276911

   
   ## CI report:
   
   * 8e44409db8f731627be1dbb55b7594bd94500e2f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

2024-01-16 Thread via GitHub



xicm commented on issue #10486:
URL: https://github.com/apache/hudi/issues/10486#issuecomment-1895237009

   Seems a bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages

2024-01-16 Thread xy (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xy updated HUDI-7304:
-
Attachment: spark_metrics_messages.jpg

> Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level  avoid 
> large mertric messages
> -
>
> Key: HUDI-7304
> URL: https://issues.apache.org/jira/browse/HUDI-7304
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Attachments: spark_metrics_messages.jpg
>
>
> Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level  avoid 
> large mertric messages



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7303) Date field type unexpectedly convert to Long when using date comparison operator

2024-01-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7303:
-
Labels: pull-request-available  (was: )

> Date field type unexpectedly convert to Long when using date comparison 
> operator
> 
>
> Key: HUDI-7303
> URL: https://issues.apache.org/jira/browse/HUDI-7303
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: flink
>Affects Versions: 0.14.0, 0.14.1
> Environment: Flink 1.15.4 Hudi 0.14.0
> Flink 1.17.1 Hudi 0.14.0
> Flink 1.17.1 Hudi 0.14.1rc1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Given the table date_dim from TPCDS as an example:
> {code:java}
> CREATE TABLE date_dim (
>   d_date_sk int,
>   d_date_id varchar(16) NOT NULL,
>   d_date date,
>   d_month_seq int,
>   d_week_seq int,
>   d_quarter_seq int,
>   d_year int,
>   d_dow int,
>   d_moy int,
>   d_dom int,
>   d_qoy int,
>   d_fy_year int, 
>   d_fy_quarter_seq int,
>   d_fy_week_seq int,
>   d_day_name varchar(9)
>   d_quarter_name varchar(6),
>   d_holiday char(1),
>   d_weekend char(1),
>   d_following_holiday char(1),
>   d_first_dom int,
>   d_last_dom int,
>   d_same_day_ly int,
>   d_same_day_lq int,
>   d_current_day char(1),
>   d_current_week char(1),
>   d_current_month char(1),
>   d_current_quarter char(1),
>   d_current_year char(1)) with (
>   'connector' = 'hudi',
>   'path' = 'hdfs:///table_path/date_dim',
>   'table.type' = 'COPY_ON_WRITE'); {code}
> When you execute the following select statement, an exception will be thrown:
> {code:java}
> select * from date_dim where d_date between cast('1999-02-22' as date) and 
> (cast('1999-02-22' as date) + INTERVAL '30' day);
> {code}
> The exception is:
> {code:java}
> java.lang.IllegalArgumentException: FilterPredicate column: d_date's declared 
> type (java.lang.Long) does not match the schema found in file metadata. 
> Column d_date is of type: INT32
> Valid types for this column are: [class java.lang.Integer]
>   at 
> org.apache.parquet.filter2.predicate.ValidTypeMap.assertTypeValid(ValidTypeMap.java:125)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:179)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:113)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.Operators$GtEq.accept(Operators.java:246)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:119)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:306) 
> ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:67)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:142)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:153)
>  ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
>   at 
> org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:78)
>  ~[hudi-flink1.17-bund

[PR] [HUDI-7303] Fix date field type unexpectedly convert to Long when usi… [hudi]

2024-01-16 Thread via GitHub



paul8263 opened a new pull request, #10517:
URL: https://github.com/apache/hudi/pull/10517

   …ng date comparison operator.
   
   ### Change Logs
   
   When using between, less than (less than or equal) or greater than (greater 
than or equal) operators with field typed of date, the date type will 
unexpected convert to Long, which is incompatible with its primitive type INT32.
   
   ### Impact
   
   No impact.
   
   ### Risk level (write none, low medium or high below)
   
   Low risk level.
   
   ### Documentation Update
   
   No need to update the documentation.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages

2024-01-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7304:
-
Labels: pull-request-available  (was: )

> Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level  avoid 
> large mertric messages
> -
>
> Key: HUDI-7304
> URL: https://issues.apache.org/jira/browse/HUDI-7304
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
>
> Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level  avoid 
> large mertric messages



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]

2024-01-16 Thread via GitHub



xuzifu666 commented on PR #10516:
URL: https://github.com/apache/hudi/pull/10516#issuecomment-1895199759

   cc @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7304] Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages [hudi]

2024-01-16 Thread via GitHub



xuzifu666 opened a new pull request, #10516:
URL: https://github.com/apache/hudi/pull/10516

   
   ### Change Logs
   
   DataSourceInternalWriterHelper::onDataWriterCommit would print a large 
number of commit details and user not need it，it would interfere user，so change 
into to debug
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7304) Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level avoid large mertric messages

2024-01-16 Thread xy (Jira)

xy created HUDI-7304:


 Summary: Change DataSourceInternalWriterHelper::onDataWriterCommit 
LOG level  avoid large mertric messages
 Key: HUDI-7304
 URL: https://issues.apache.org/jira/browse/HUDI-7304
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: xy
Assignee: xy


Change DataSourceInternalWriterHelper::onDataWriterCommit LOG level  avoid 
large mertric messages



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #9918:
URL: https://github.com/apache/hudi/issues/9918#issuecomment-1895063366

   @victorxiang30 @Armelabdelkbir @watermelon12138 Can you provide the schema 
to help me to reproduce this. 
   
   If it has complex data type, can you try setting spark config 
spark.hadoop.parquet.avro.write-old-list-structure as false. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on issue #10503:
URL: https://github.com/apache/hudi/issues/10503#issuecomment-1895010618

   What is the requested file `6548b5aa910845504c7cdea4_1705406501315.795.csv`, 
it does not belongs to Hoodie data format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #10303:
URL: https://github.com/apache/hudi/issues/10303#issuecomment-1895009159

   @srinikandi Sorry for the delay on this. 
   
   I was able to reproduce the issue with Hudi version 0.12.1 and 0.14.1. We 
have introduced the config 
"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled", 
you can set it to True.
   
   ```
 public static final ConfigProperty 
KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED = ConfigProperty
 
.key("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled")
 .defaultValue("false")
 .withDocumentation("When set to true, consistent value will be 
generated for a logical timestamp type column, "
 + "like timestamp-millis and timestamp-micros, irrespective of 
whether row-writer is enabled. Disabled by default so "
 + "as not to break the pipeline that deploy either fully 
row-writer path or non row-writer path. For example, "
 + "if it is kept disabled then record key of timestamp type with 
value `2016-12-29 09:54:00` will be written as timestamp "
 + "`2016-12-29 09:54:00.0` in row-writer path, while it will be 
written as long value `148302324000` in non row-writer path. "
 + "If enabled, then the timestamp value will be written in both 
the cases.");
   ```
   
   Reproducible Code which works when we set the config. - 
   
   ```
   from faker import Faker
   import pandas as pd
   from pyspark.sql import SparkSession
   import pyspark.sql.functions as F
   
   #..   Fake Data Generation 
...
   fake = Faker()
   data = [{"transactionId": fake.uuid4(), "EventTime": "2014-01-01 
23:00:01","storeNbr" : "1",
"FullName": fake.name(), "Address": fake.address(),
"CompanyName": fake.company(), "JobTitle": fake.job(),
"EmailAddress": fake.email(), "PhoneNumber": fake.phone_number(),
"RandomText": fake.sentence(), "City": fake.city(),
"State": "NYC", "Country": "US"} for _ in range(5)]
   pandas_df = pd.DataFrame(data)
   
   hoodi_configs = {
   "hoodie.insert.shuffle.parallelism": "1",
   "hoodie.upsert.shuffle.parallelism": "1",
   "hoodie.bulkinsert.shuffle.parallelism": "1",
   "hoodie.delete.shuffle.parallelism": "1",
   "hoodie.datasource.write.row.writer.enable": "true",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.recordkey.field": 
"transactionId,storeNbr,EventTime",
   "hoodie.datasource.write.precombine.field": "Country",
   "hoodie.datasource.write.partitionpath.field": "State",
   "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.DefaultHoodieRecordPayload",
   "hoodie.datasource.write.hive_style_partitioning": "true",
   "hoodie.combine.before.upsert": "true",
   "hoodie.table.name": "huditransaction",
   
"hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled": 
"false",
   }
   spark.sparkContext.setLogLevel("WARN")
   
   df = spark.createDataFrame(pandas_df).withColumn("EventTime", 
expr("cast(EventTime as timestamp)"))
   
df.write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","bulk_insert").mode("overwrite").save(PATH)
   
spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False)
   
df.withColumn("City",lit("updated_city")).write.format("hudi").options(**hoodi_configs).option("hoodie.datasource.write.operation","upsert").mode("append").save(PATH)
   
spark.read.options(**hoodi_configs).format("hudi").load(PATH).select("_hoodie_record_key").show(10,False)
   ```
   
   Let me know in case you need any more help on this. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping (#10493)

2024-01-16 Thread stream2000

This is an automated email from the ASF dual-hosted git repository.

stream2000 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new eae5d4ae8e6 [HUDI-7291] Pushing Down Partition Pruning Conditions to 
Column Stats Earlier During Data Skipping (#10493)
eae5d4ae8e6 is described below

commit eae5d4ae8e62014191fac76bbbeae0939f11100b
Author: majian <47964462+majian1...@users.noreply.github.com>
AuthorDate: Wed Jan 17 14:17:29 2024 +0800

[HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats 
Earlier During Data Skipping (#10493)

* push down partition pruning filters when loading col stats index
---
 .../org/apache/hudi/ColumnStatsIndexSupport.scala  | 14 ++--
 .../scala/org/apache/hudi/HoodieFileIndex.scala| 37 ++
 2 files changed, 36 insertions(+), 15 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
index 9cdb15092b0..7a75c6c35ca 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala
@@ -26,6 +26,7 @@ import org.apache.hudi.avro.model._
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.common.config.HoodieMetadataConfig
 import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.function.SerializableFunction
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.common.table.HoodieTableMetaClient
 import org.apache.hudi.common.util.BinaryUtil.toBytes
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedFileNames: Set[String] = Set.empty)(block: DataFrame => T): T = {
 cachedColumnStatsIndexViews.get(targetColumns) match {
   case Some(cachedDF) =>
 block(cachedDF)
 
   case None =>
-val colStatsRecords: HoodieData[HoodieMetadataColumnStats] =
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if 
(prunedFileNames.isEmpty) {
+  // NOTE: Because some tests directly check this method and don't get 
prunedPartitionsAndFileSlices, we need to make sure these tests are correct.
   loadColumnStatsIndexRecords(targetColumns, shouldReadInMemory)
+} else {
+  val filterFunction = new 
SerializableFunction[HoodieMetadataColumnStats, java.lang.Boolean] {
+override def apply(r: HoodieMetadataColumnStats): 
java.lang.Boolean = {
+  prunedFileNames.contains(r.getFileName)
+}
+  }
+  loadColumnStatsIndexRecords(targetColumns, 
shouldReadInMemory).filter(filterFunction)
+}
 
 withPersistedData(colStatsRecords, StorageLevel.MEMORY_ONLY) {
   val (transposedRows, indexSchema) = transpose(colStatsRecords, 
targetColumns)
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
index 709dfec183b..db8525be3d1 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
@@ -234,7 +234,7 @@ case class HoodieFileIndex(spark: SparkSession,
   //- Record-level Index is present
   //- List of predicates (filters) is present
   val candidateFilesNamesOpt: Option[Set[String]] =
-  lookupCandidateFilesInMetadataTable(dataFilters) match {
+  lookupCandidateFilesInMetadataTable(dataFilters, 
prunedPartitionsAndFileSlices) match {
 case Success(opt) => opt
 case Failure(e) =>
   logError("Failed to lookup candidate files in File Index", e)
@@ -316,11 +316,6 @@ case class HoodieFileIndex(spark: SparkSession,
 })
   }
 
-  private def lookupFileNamesMissingFromIndex(allIndexedFileNames: 
Set[String]) = {
-val allFileNames = getAllFiles().map(f => f.getPath.getName).toSet
-allFileNames -- allIndexedFileNames
-  }
-
   /**
* Computes pruned list of candidate base-files' names based on provided 
list of {@link dataFilters}
* conditions, by leveraging Metadata Table's Record Level Index and Column 
Statistics index (hereon referred as
@@ -333,7 +328,7 @@ case class HoodieFileIndex(spark: SparkSession,
* @param que

Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub



stream2000 merged PR #10493:
URL: https://github.com/apache/hudi/pull/10493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10515:
URL: https://github.com/apache/hudi/pull/10515#issuecomment-1895004422

   
   ## CI report:
   
   * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994)
 
   * 6d7150a24ab2169d780e5a98193144f5a16ad230 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21996)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10515:
URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894997004

   
   ## CI report:
   
   * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994)
 
   * 6d7150a24ab2169d780e5a98193144f5a16ad230 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DOCS] Add parquet merge schema config [hudi]

2024-01-16 Thread via GitHub



yihua commented on code in PR #10463:
URL: https://github.com/apache/hudi/pull/10463#discussion_r1454666401


##
website/docs/configurations.md:
##
@@ -1792,6 +1792,16 @@ Configurations controlling the behavior of Kafka source 
in Hudi Streamer.
 | 
[hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass)
   | io.confluent.kafka.serializers.KafkaAvroDeserializer  | This class is used 
by kafka client to deserialize the records.`Config Param: 
KAFKA_AVRO_VALUE_DESERIALIZER_CLASS``Since Version: 0.9.0`





 |
 ---
 
+ Parquet DFS Source Configs {#Parquet-DFS-Source-Configs}

Review Comment:
   Config page is automatically generated.  Just to double check, did you use 
the [tool](https://github.com/apache/hudi/tree/asf-site/hudi-utils) to generate 
these changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7303) Date field type unexpectedly convert to Long when using date comparison operator

2024-01-16 Thread Yao Zhang (Jira)

Yao Zhang created HUDI-7303:
---

 Summary: Date field type unexpectedly convert to Long when using 
date comparison operator
 Key: HUDI-7303
 URL: https://issues.apache.org/jira/browse/HUDI-7303
 Project: Apache Hudi
  Issue Type: Bug
  Components: flink
Affects Versions: 0.14.1, 0.14.0
 Environment: Flink 1.15.4 Hudi 0.14.0
Flink 1.17.1 Hudi 0.14.0
Flink 1.17.1 Hudi 0.14.1rc1
Reporter: Yao Zhang
Assignee: Yao Zhang


Given the table date_dim from TPCDS as an example:
{code:java}
CREATE TABLE date_dim (
  d_date_sk int,
  d_date_id varchar(16) NOT NULL,
  d_date date,
  d_month_seq int,
  d_week_seq int,
  d_quarter_seq int,
  d_year int,
  d_dow int,
  d_moy int,
  d_dom int,
  d_qoy int,
  d_fy_year int, 
  d_fy_quarter_seq int,
  d_fy_week_seq int,
  d_day_name varchar(9)
  d_quarter_name varchar(6),
  d_holiday char(1),
  d_weekend char(1),
  d_following_holiday char(1),
  d_first_dom int,
  d_last_dom int,
  d_same_day_ly int,
  d_same_day_lq int,
  d_current_day char(1),
  d_current_week char(1),
  d_current_month char(1),
  d_current_quarter char(1),
  d_current_year char(1)) with (
  'connector' = 'hudi',
  'path' = 'hdfs:///table_path/date_dim',
  'table.type' = 'COPY_ON_WRITE'); {code}

When you execute the following select statement, an exception will be thrown:

{code:java}
select * from date_dim where d_date between cast('1999-02-22' as date) and 
(cast('1999-02-22' as date) + INTERVAL '30' day);
{code}

The exception is:

{code:java}
java.lang.IllegalArgumentException: FilterPredicate column: d_date's declared 
type (java.lang.Long) does not match the schema found in file metadata. Column 
d_date is of type: INT32
Valid types for this column are: [class java.lang.Integer]
at 
org.apache.parquet.filter2.predicate.ValidTypeMap.assertTypeValid(ValidTypeMap.java:125)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:179)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:113)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.Operators$GtEq.accept(Operators.java:246) 
~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:119)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:306) 
~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95) 
~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45) 
~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:67)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:142)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:153)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.hudi.table.format.RecordIterators.getParquetRecordIterator(RecordIterators.java:78)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:130)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.hudi.table.format.cow.CopyOnWriteInputFormat.open(CopyOnWriteInputFormat.java:66)
 ~[hudi-flink1.17-bundle-0.14.0.jar:0.14.0]
at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
 ~[flink-dist-1.17.1.jar:1.17.1]
a

(hudi) branch master updated (108a885b4db -> d899fba9c71)

2024-01-16 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 108a885b4db [HUDI-7294] TVF to query hudi metadata (#10491)
 add d899fba9c71 Revert "[MINOR] Handle parsing of all zero timestamps with 
MDT suffixes." (#10514)

No new revisions were added by this update.

Summary of changes:
 .../common/table/timeline/HoodieInstantTimeGenerator.java   |  4 
 .../common/table/timeline/TestHoodieActiveTimeline.java | 13 -
 2 files changed, 17 deletions(-)

Re: [PR] [MINOR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]

2024-01-16 Thread via GitHub



yihua merged PR #10514:
URL: https://github.com/apache/hudi/pull/10514


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894988743

   
   ## CI report:
   
   * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN
   * d4c05ddde2295cf97a5b40edc3a7d62deca5a326 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21993)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]

2024-01-16 Thread via GitHub



gsudhanshu commented on issue #10503:
URL: https://github.com/apache/hudi/issues/10503#issuecomment-1894985425

   yes I am using pyspark 3.4.2
   
   complete error log:
   
   ```
   An error occurred while calling o208.load. : java.io.FileNotFoundException: 
File 
/var/www/maustats/primaryData/CD/6548b5aa910845504c7cdea4_1705406501315.795.csv 
does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) 
at 
org.apache.hudi.common.util.TablePathUtils.getTablePath(TablePathUtils.java:58) 
at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:79) at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:111) at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:74) at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFra
 meReader.scala:229) at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211) 
at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at 
py4j.Gateway.invoke(Gateway.java:282) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.ClientServerConnection.waitForCommands(Cl
 ientServerConnection.java:182) at 
py4j.ClientServerConnection.run(ClientServerConnection.java:106) at 
java.base/java.lang.Thread.run(Thread.java:829)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-16 Thread via GitHub



maheshguptags commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894959705

   @danny0405 can you please share the config to deduct the filegroup 
per-commit?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10515:
URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894955959

   
   ## CI report:
   
   * b4df6b857e79dfb636e3af695d305e8ea50077cc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21994)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894955106

   
   ## CI report:
   
   * 0c7300dbe529e40a4ce261032787843e241f2b45 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21966)
 
   * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN
   * d4c05ddde2295cf97a5b40edc3a7d62deca5a326 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10515:
URL: https://github.com/apache/hudi/pull/10515#issuecomment-1894950233

   
   ## CI report:
   
   * b4df6b857e79dfb636e3af695d305e8ea50077cc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1894949364

   
   ## CI report:
   
   * 0c7300dbe529e40a4ce261032787843e241f2b45 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21966)
 
   * cefd96781f2f87b7af3a92e5c6334724f7aeb400 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10497:
URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894943957

   
   ## CI report:
   
   * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN
   * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21988)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7302) Consistent Hashing row writer support sorting

2024-01-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7302:
-
Labels: pull-request-available  (was: )

> Consistent Hashing row writer support sorting
> -
>
> Key: HUDI-7302
> URL: https://issues.apache.org/jira/browse/HUDI-7302
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Qijun Fu
>Priority: Major
>  Labels: pull-request-available
>
> Consistent Hashing row writer support sorting



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

2024-01-16 Thread via GitHub



stream2000 opened a new pull request, #10515:
URL: https://github.com/apache/hudi/pull/10515

   ### Change Logs
   
   Consistent Hashing row writer support sorting
   
   ### Impact
   
   now consistent hashing clustering support sorting
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   NONE
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-7302) Consistent Hashing row writer support sorting

2024-01-16 Thread Qijun Fu (Jira)

Qijun Fu created HUDI-7302:
--

 Summary: Consistent Hashing row writer support sorting
 Key: HUDI-7302
 URL: https://issues.apache.org/jira/browse/HUDI-7302
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Qijun Fu


Consistent Hashing row writer support sorting



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]

2024-01-16 Thread via GitHub



codope closed issue #9826: [SUPPORT] Spark job stuck after completion, due to 
some non daemon threads still running
URL: https://github.com/apache/hudi/issues/9826


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Spark job stuck after completion, due to some non daemon threads still running [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #9826:
URL: https://github.com/apache/hudi/issues/9826#issuecomment-1894915016

   Closing this issue as 0.14.1 is realeased. Please reopen in case you see 
this issue again @zyclove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Solution for synchronizing the entire database table in flink [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #9965:
URL: https://github.com/apache/hudi/issues/9965#issuecomment-1894912026

   @bajiaolong Closing out this, Please reopen or create a new one for further 
queries. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] UPSERTs are taking time [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1894910852

   @darlatrade Did the suggestion worked? DO you need any other help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT]Flink writes MOR table, both RO table and RT table read nothing by hive [hudi]

2024-01-16 Thread via GitHub



ad1happy2go commented on issue #10465:
URL: https://github.com/apache/hudi/issues/10465#issuecomment-1894908381

   @nicholasxu They are deleted as part of cleaning process. We do need them 
for point in time queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10514:
URL: https://github.com/apache/hudi/pull/10514#issuecomment-1894906031

   
   ## CI report:
   
   * fb3087b8709a75b658f802b5c1d5fbcc7cfbbd65 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21992)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10514:
URL: https://github.com/apache/hudi/pull/10514#issuecomment-1894900395

   
   ## CI report:
   
   * fb3087b8709a75b658f802b5c1d5fbcc7cfbbd65 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Revert "[MINOR] Handle parsing of all zero timestamps with MDT suffixes." [hudi]

2024-01-16 Thread via GitHub



linliu-code opened a new pull request, #10514:
URL: https://github.com/apache/hudi/pull/10514

   Reverts apache/hudi#10481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Handle parsing of all zero timestamps with MDT suffixes. [hudi]

2024-01-16 Thread via GitHub



linliu-code commented on PR #10481:
URL: https://github.com/apache/hudi/pull/10481#issuecomment-1894889506

   @prashantwason, the test failure caused by this change keeps failing the 
master branch. Please revert this PR and fix it before resubmit it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF

2024-01-16 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat reassigned HUDI-7301:
-

Assignee: Vinaykumar Bhat

> Update hudi docs/websites with documentation for the new spark TVF
> --
>
> Key: HUDI-7301
> URL: https://issues.apache.org/jira/browse/HUDI-7301
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>
> Hudi documentation and website needs to be updated to reflect the support for 
> new spark-sql related table-valued-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF

2024-01-16 Thread Vinaykumar Bhat (Jira)

Vinaykumar Bhat created HUDI-7301:
-

 Summary: Update hudi docs/websites with documentation for the new 
spark TVF
 Key: HUDI-7301
 URL: https://issues.apache.org/jira/browse/HUDI-7301
 Project: Apache Hudi
  Issue Type: Task
Reporter: Vinaykumar Bhat


Hudi documentation and website needs to be updated to reflect the support for 
new spark-sql related table-valued-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on PR #10389:
URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894865885

   Looks good to me, just take care of the test failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10389:
URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894861662

   
   ## CI report:
   
   * 248df7c04d611c5f521f309732aa21351161fa8b UNKNOWN
   * 0bd0b5188c73636a79d9d2b43a452497afa137f7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21989)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7270] Support schema evolution by Flink SQL using HoodieCatalog [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on code in PR #10494:
URL: https://github.com/apache/hudi/pull/10494#discussion_r1454426428


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalogUtil.java:
##
@@ -172,4 +196,94 @@ public static List getOrderedPartitionValues(
 
 return values;
   }
+
+  protected static void alterTable(

Review Comment:
   Can we give some doc to this method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7270] Support schema evolution by Flink SQL using HoodieCatalog [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on code in PR #10494:
URL: https://github.com/apache/hudi/pull/10494#discussion_r1454427215


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalogUtil.java:
##
@@ -172,4 +196,94 @@ public static List getOrderedPartitionValues(
 
 return values;
   }
+
+  protected static void alterTable(
+  AbstractCatalog catalog,
+  ObjectPath tablePath,
+  CatalogBaseTable newCatalogTable,
+  List tableChanges,
+  boolean ignoreIfNotExists,
+  org.apache.hadoop.conf.Configuration hadoopConf,
+  BiFunction inferTablePathFunc,
+  BiConsumer postAlterTableFunc) throws 
TableNotExistException, CatalogException {
+checkNotNull(tablePath, "Table path cannot be null");
+checkNotNull(newCatalogTable, "New catalog table cannot be null");
+
+if (!isUpdatePermissible(catalog, tablePath, newCatalogTable, 
ignoreIfNotExists)) {
+  return;
+}
+if (!tableChanges.isEmpty()) {
+  CatalogBaseTable oldTable = catalog.getTable(tablePath);
+  HoodieFlinkWriteClient writeClient = createWriteClient(tablePath, 
oldTable, hadoopConf, inferTablePathFunc);
+  Pair pair = 
writeClient.getInternalSchemaAndMetaClient();
+  InternalSchema oldSchema = pair.getLeft();
+  Function convertFunc = (LogicalType logicalType) -> 
AvroInternalSchemaConverter.convertToField(AvroSchemaConverter.convertToSchema(logicalType));
+  InternalSchema newSchema = Utils.applyTableChange(oldSchema, 
tableChanges, convertFunc);
+  if (!oldSchema.equals(newSchema)) {
+writeClient.setOperationType(WriteOperationType.ALTER_SCHEMA);
+writeClient.commitTableChange(newSchema, pair.getRight());
+  }
+}
+postAlterTableFunc.accept(tablePath, newCatalogTable);
+  }
+
+  protected static HoodieFlinkWriteClient createWriteClient(
+  ObjectPath tablePath,
+  CatalogBaseTable table,
+  org.apache.hadoop.conf.Configuration hadoopConf,
+  BiFunction inferTablePathFunc) {
+Map options = table.getOptions();
+String tablePathStr = inferTablePathFunc.apply(tablePath, table);
+return createWriteClient(options, tablePathStr, tablePath, hadoopConf);
+  }
+
+  protected static HoodieFlinkWriteClient createWriteClient(
+  Map options,
+  String tablePathStr,
+  ObjectPath tablePath,
+  org.apache.hadoop.conf.Configuration hadoopConf) {
+// enable auto-commit though ~
+options.put(HoodieWriteConfig.AUTO_COMMIT_ENABLE.key(), "true");

Review Comment:
   Not sure whether this is needed for all the scenarios?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10389:
URL: https://github.com/apache/hudi/pull/10389#issuecomment-1894856206

   
   ## CI report:
   
   * 248df7c04d611c5f521f309732aa21351161fa8b UNKNOWN
   * 9aa9291d5b52c9801420505a91e60c92bf8439a2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21807)
 
   * 0bd0b5188c73636a79d9d2b43a452497afa137f7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894856440

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10497:
URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894856364

   
   ## CI report:
   
   * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963)
 
   * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN
   * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21988)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-7294) Add TVF to query hudi metadata

2024-01-16 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7294.
-
Fix Version/s: 1.0.0
   Resolution: Done

> Add TVF to query hudi metadata
> --
>
> Key: HUDI-7294
> URL: https://issues.apache.org/jira/browse/HUDI-7294
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Having a table valued function to query hudi metadata for a given table 
> through spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

(hudi) branch master updated: [HUDI-7294] TVF to query hudi metadata (#10491)

2024-01-16 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 108a885b4db [HUDI-7294] TVF to query hudi metadata (#10491)
108a885b4db is described below

commit 108a885b4db62f08d30ede47805b8b44c35ab1e6
Author: bhat-vinay <152183592+bhat-vi...@users.noreply.github.com>
AuthorDate: Wed Jan 17 08:21:07 2024 +0530

[HUDI-7294] TVF to query hudi metadata (#10491)

Adds a TVF function to query hudi metadata through spark-sql. Since the 
metadata is already a MOR table, it simply creates a 'snapshot' on
a MOR relation. Could not find any way to format (or filter) the RDD 
generated by the MOR snapshot relation. Uploading the PR to get some feedback.

Co-authored-by: Vinaykumar Bhat 
---
 .../sql/hudi/TestHoodieTableValuedFunction.scala   | 68 ++
 .../logcal/HoodieMetadataTableValuedFunction.scala | 46 +++
 .../hudi/analysis/HoodieSpark32PlusAnalysis.scala  | 17 +-
 .../sql/hudi/analysis/TableValuedFunctions.scala   |  7 ++-
 4 files changed, 136 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala
index 867e83c301e..bdf512d3451 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala
@@ -21,6 +21,8 @@ import 
org.apache.hudi.DataSourceWriteOptions.SPARK_SQL_INSERT_INTO_OPERATION
 import org.apache.hudi.HoodieSparkUtils
 import org.apache.spark.sql.functions.{col, from_json}
 
+import scala.collection.Seq
+
 class TestHoodieTableValuedFunction extends HoodieSparkSqlTestBase {
 
   test(s"Test hudi_query Table-Valued Function") {
@@ -558,4 +560,70 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow", "mor").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result2DF = spark.sql(
+s"select type, key, filesystemmetadata from 
hudi_metadata('$identifier') where type=1"
+  )
+  assert(result2DF.count() == 1)
+
+  val result3DF = spark.sql(
+s"select type, key, filesystemmetadata from 
hudi_metadata('$identifier') where type=2"
+  )
+  assert(result3DF.count() == 3)
+
+  val result4DF = spark.sql(
+s"select type, key, ColumnStatsMetadata from 
hudi_metadata('$identifier') where type=3"
+  )
+  assert(result4DF.count() == 3)
+
+  val result5DF = spark.sql(
+s"select type, key, recordIndexMetadata from 
hudi_metadata('$identifier') where type=5"
+  )
+  assert(result5DF.count() == 3)
+
+  val result6DF = spark.sql(
+s"select type, key, BloomFilterMetadata from 
hudi_metadata('$identifier') where BloomFilterMetadata is not null"
+  )
+  assert(result6DF.count() == 0)
+}
+  }
+}
+spark.sessionState.conf.unsetConf(SPARK_SQL_INSERT_INTO_OPERATION.key)
+  }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logcal/HoodieMetadataTableValuedFunction.scala
 
b/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logcal/HoodieMetadataTableVa

Re: [PR] [HUDI-7294] TVF to query hudi metadata [hudi]

2024-01-16 Thread via GitHub



codope merged PR #10491:
URL: https://github.com/apache/hudi/pull/10491


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch asf-site updated: [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types (#10510)

2024-01-16 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a702ced7f0f [DOCS] Diagram Changes for Clustering, Rollbacks, Table 
Types (#10510)
a702ced7f0f is described below

commit a702ced7f0f4e0e058ae0f0eaff28ec278f62fbf
Author: Dipankar Mazumdar <103004148+dipankarmazum...@users.noreply.github.com>
AuthorDate: Tue Jan 16 21:46:06 2024 -0500

[DOCS] Diagram Changes for Clustering, Rollbacks, Table Types (#10510)

* remaining diagrams

* fixed issue with rollbacks page

-

Co-authored-by: Dipankar Mazumdar 
---
 website/docs/clustering.md|   6 +++---
 website/docs/rollbacks.md |   4 ++--
 website/docs/table_types.md   |   4 ++--
 website/static/assets/images/COW_new.png  | Bin 0 -> 1034864 bytes
 website/static/assets/images/MOR_new.png  | Bin 0 -> 1342587 bytes
 .../assets/images/blog/clustering/clustering1_new.png | Bin 0 -> 1420549 bytes
 .../assets/images/blog/clustering/clustering2_new.png | Bin 0 -> 302821 bytes
 .../assets/images/blog/clustering/clustering_3.png| Bin 0 -> 513090 bytes
 .../assets/images/blog/rollbacks/Rollback_1.png   | Bin 0 -> 311672 bytes
 .../assets/images/blog/rollbacks/rollback2_new.png| Bin 0 -> 569899 bytes
 10 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 2feab1902ac..7749292b1cf 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -59,7 +59,7 @@ Clustering Service builds on Hudi’s MVCC based design to 
allow for writers to
 
 NOTE: Clustering can only be scheduled for tables / partitions not receiving 
any concurrent updates. In the future, concurrent updates use-case will be 
supported as well.
 
-![Clustering 
example](/assets/images/blog/clustering/example_perf_improvement.png)
+![Clustering example](/assets/images/blog/clustering/clustering1_new.png)
 _Figure: Illustrating query performance improvements by clustering_
 
 ## Clustering Usecases
@@ -71,7 +71,7 @@ such small files could lead to higher query latency. From our 
experience support
 few users who are using Hudi just for small file handling capabilities. So, 
you could employ clustering to batch a lot
 of such small files into larger ones.
 
-![Batching small files](/assets/images/clustering_small_files.gif)
+![Batching small files](/assets/images/blog/clustering/clustering2_new.png)
 
 ### Cluster by sort key
 
@@ -80,7 +80,7 @@ arrival time, while query predicates do not sit well with it. 
With clustering, y
 based on query predicates and so, your data skipping will be very efficient 
and your query can ignore scanning a lot of
 unnecessary data.
 
-![Batching small files](/assets/images/clustering_sort.gif)
+![Batching small files](/assets/images/blog/clustering/clustering_3.png)
 
 ## Clustering Strategies
 
diff --git a/website/docs/rollbacks.md b/website/docs/rollbacks.md
index 5a2ebf2a70b..c78b8f3b084 100644
--- a/website/docs/rollbacks.md
+++ b/website/docs/rollbacks.md
@@ -35,7 +35,7 @@ for any actions/commits that is not yet committed and that 
refers to partially f
 is triggered and all dirty data is cleaned up followed by cleaning up the 
commit instants from the timeline.
 
 
-![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/single_write_rollback.png)
+![An example illustration of single writer 
rollbacks](/assets/images/blog/rollbacks/Rollback_1.png)
 _Figure 1: single writer with eager rollbacks_
 
 
@@ -63,7 +63,7 @@ information whether the writer that started the commit of 
interest is still maki
 the commit, the heartbeat file is deleted. Or if the write failed midway, the 
last modification time of the heartbeat 
 file is no longer updated, so other writers can deduce the failed write after 
a period of time elapses.
 
-![An example illustration of multi writer 
rollbacks](/assets/images/blog/rollbacks/multi_writer_rollback.png)
+![An example illustration of multi writer 
rollbacks](/assets/images/blog/rollbacks/rollback2_new.png)
 _Figure 2: multi-writer with lazy cleaning of failed commits_
 
 ## Related Resources
diff --git a/website/docs/table_types.md b/website/docs/table_types.md
index 28814d239e8..e280909a9f3 100644
--- a/website/docs/table_types.md
+++ b/website/docs/table_types.md
@@ -69,7 +69,7 @@ Following illustrates how this works conceptually, when data 
written into copy-o
 
 
 
-
+
 
 
 
@@ -97,7 +97,7 @@ their columnar base file, to keep the query performance in 
check (larger delta l
 Following illustrates how the table works, and shows two types of queries - 
snapshot query and read optimized query.
 
 
-
+
 
 
 There are lot of interesting things happening in this example

Re: [PR] [DOCS] Diagram Changes for Clustering, Rollbacks, Table Types [hudi]

2024-01-16 Thread via GitHub



danny0405 merged PR #10510:
URL: https://github.com/apache/hudi/pull/10510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7299] BucketIndex table should forbit append mode [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on code in PR #10505:
URL: https://github.com/apache/hudi/pull/10505#discussion_r1454399288


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##
@@ -111,7 +111,7 @@ public class Pipelines {
*/
   public static DataStreamSink bulkInsert(Configuration conf, RowType 
rowType, DataStream dataStream) {
 WriteOperatorFactory operatorFactory = 
BulkInsertWriteOperator.getFactory(conf, rowType);
-if (OptionsResolver.isBucketIndexType(conf)) {
+if (!OptionsResolver.isAppendMode(conf) && 
OptionsResolver.isBucketIndexType(conf)) {

Review Comment:
   In `HoodieTableSink`, the append mode has the first priority, that means an 
append only table would never take the bucket index into effect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Migration partitionned table with complex key generator to 0.14.1 leads to duplicates when recordkey length =1 [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on issue #10508:
URL: https://github.com/apache/hudi/issues/10508#issuecomment-1894845482

   Yeah, this is a mistake, we should not include this for 0.14.1 release, it 
is intended for 1.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Support] An error occurred while calling o1748.load.\n: java.io.FileNotFoundException [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on issue #10503:
URL: https://github.com/apache/hudi/issues/10503#issuecomment-1894844589

   Are you using py-spark, looks like a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [MINOR] Fix eager rollback mdt ut (#10506)

2024-01-16 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 163053408f2 [MINOR] Fix eager rollback mdt ut (#10506)
163053408f2 is described below

commit 163053408f258c16085ce6bc7c11eccd2319a491
Author: KnightChess <981159...@qq.com>
AuthorDate: Wed Jan 17 10:38:27 2024 +0800

[MINOR] Fix eager rollback mdt ut (#10506)

Signed-off-by: wulingqi <981159...@qq.com>
---
 .../java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
index 42242fdfa32..a44d98c4f8b 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
@@ -1534,8 +1534,8 @@ public class TestJavaHoodieBackedMetadata extends 
TestHoodieMetadataBase {
 
fileStatus.getPath().getName().equals(rollbackInstant.getFileName())).collect(Collectors.toList());
 
 // ensure commit3's delta commit in MDT has last mod time > the actual 
rollback for previous failed commit i.e. commit2.
-// if rollback wasn't eager, rollback's last mod time will be lower than 
the commit3'd delta commit last mod time.
-assertTrue(commit3Files.get(0).getModificationTime() > 
rollbackFiles.get(0).getModificationTime());
+// if rollback wasn't eager, rollback's last mod time will be not larger 
than the commit3'd delta commit last mod time.
+assertTrue(commit3Files.get(0).getModificationTime() >= 
rollbackFiles.get(0).getModificationTime());
 client.close();
   }

Re: [PR] [MINOR] fix eager rollback mdt ut [hudi]

2024-01-16 Thread via GitHub



danny0405 merged PR #10506:
URL: https://github.com/apache/hudi/pull/10506


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]

2024-01-16 Thread via GitHub



linliu-code commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894842428

   @hudi-bot run azure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Partitioning data into two keys is taking more time (10x) than partitioning into one key. [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on issue #10456:
URL: https://github.com/apache/hudi/issues/10456#issuecomment-1894838453

   Yeah, try to deduct the number of file groups per-commit, because for each 
file group, we have a in-memory buffer before flushing into disk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7001] ComplexAvroKeyGenerator should represent single record key as the value string without composing the key field name [hudi]

2024-01-16 Thread via GitHub



danny0405 commented on code in PR #9936:
URL: https://github.com/apache/hudi/pull/9936#discussion_r1454379931


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/RowDataKeyGen.java:
##
@@ -99,7 +99,7 @@ protected RowDataKeyGen(
   this.recordKeyProjection = null;
 } else {
   this.recordKeyFields = recordKeys.get().split(",");
-  if (this.recordKeyFields.length == 1) {
+  if (this.recordKeyFields.length == 1  && this.partitionPathFields.length 
== 1) {

Review Comment:
   Are you using 0.14.1? 0.14.0 should not include this commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



paul8263 commented on code in PR #10497:
URL: https://github.com/apache/hudi/pull/10497#discussion_r1454374435


##
hudi-flink-datasource/hudi-flink1.14.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java:
##
@@ -460,59 +460,59 @@ private static WritableColumnVector 
createWritableColumnVector(
   case BOOLEAN:
 checkArgument(
 typeName == PrimitiveType.PrimitiveTypeName.BOOLEAN,
-"Unexpected type: %s", typeName);
+"Unexpected type exception. Primitive type: %s. Field type: %s.", 
typeName, fieldType.getTypeRoot().name());

Review Comment:
   I extracted it as a static method. The code for error message construction 
won't be duplicated too many times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



paul8263 commented on code in PR #10497:
URL: https://github.com/apache/hudi/pull/10497#discussion_r1454372097


##
hudi-flink-datasource/hudi-flink1.14.x/src/main/java/org/apache/hudi/table/format/cow/vector/reader/ParquetColumnarRowSplitReader.java:
##
@@ -218,11 +218,17 @@ private WritableColumnVector[] createWritableVectors() {
 List types = requestedSchema.getFields();
 List descriptors = requestedSchema.getColumns();
 for (int i = 0; i < requestedTypes.length; i++) {
-  columns[i] = createWritableColumnVector(
-  batchSize,
-  requestedTypes[i],
-  types.get(i),
-  descriptors);
+  String fieldName = requestedSchema.getFieldName(i);

Review Comment:
   Hi @danny0405 ,
   
   Correct. It should be moved to the catch block as it would be only needed if 
there was an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10497:
URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894823533

   
   ## CI report:
   
   * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963)
 
   * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN
   * dd7311e64b0747772b7d20ad232feb3a4be0bdd9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7297] Fix ambiguous error message when field type defined in sc… [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10497:
URL: https://github.com/apache/hudi/pull/10497#issuecomment-1894817105

   
   ## CI report:
   
   * c586484b0f4587c465a469b3bdf9fbf0bef28666 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21963)
 
   * fda81dda555c15ab51ee817fda9977bdcde84356 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Fix a unit test [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10513:
URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894809514

   
   ## CI report:
   
   * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21986)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



yihua merged PR #10199:
URL: https://github.com/apache/hudi/pull/10199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [HUDI-7300] Merge schema in ParuqetDFSSource (#10199)

2024-01-16 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new ca5d4685a00 [HUDI-7300] Merge schema in ParuqetDFSSource (#10199)
ca5d4685a00 is described below

commit ca5d4685a002a3b3da917f6b195e27dcb20d7316
Author: Rohit Mittapalli 
AuthorDate: Tue Jan 16 17:52:07 2024 -0800

[HUDI-7300] Merge schema in ParuqetDFSSource (#10199)
---
 .../utilities/config/ParquetDFSSourceConfig.java   | 49 ++
 .../hudi/utilities/sources/ParquetDFSSource.java   |  6 ++-
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java
new file mode 100644
index 000..b3bf5678baf
--- /dev/null
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+  public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+  .key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.merge_schema.enable")
+  .defaultValue(false)
+  .withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + 
"source.parquet.dfs.merge_schema.enable")
+  .markAdvanced()
+  .sinceVersion("1.0.0")
+  .withDocumentation("Merge schema across parquet files within a single 
write");
+}
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java
index a56a878f1fe..a3ee555ec5a 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java
@@ -21,6 +21,7 @@ package org.apache.hudi.utilities.sources;
 import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.config.ParquetDFSSourceConfig;
 import org.apache.hudi.utilities.schema.SchemaProvider;
 import org.apache.hudi.utilities.sources.helpers.DFSPathSelector;
 
@@ -29,6 +30,8 @@ import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
 
+import static org.apache.hudi.common.util.ConfigUtils.getBooleanWithAltKeys;
+
 /**
  * DFS Source that reads parquet data.
  */
@@ -52,6 +55,7 @@ public class ParquetDFSSource extends RowSource {
   }
 
   private Dataset fromFiles(String pathStr) {
-return sparkSession.read().parquet(pathStr.split(","));
+boolean mergeSchemaOption = getBooleanWithAltKeys(this.props, 
ParquetDFSSourceConfig.PARQUET_DFS_MERGE_SCHEMA);
+return sparkSession.read().option("mergeSchema", 
mergeSchemaOption).parquet(pathStr.split(","));
   }
 }

Re: [PR] [HUDI-6902] Fix a unit test [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10513:
URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894774632

   
   ## CI report:
   
   * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21986)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Fix a unit test [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10513:
URL: https://github.com/apache/hudi/pull/10513#issuecomment-1894767088

   
   ## CI report:
   
   * 1f2d9784509db1c4b370862a02e1c1ee2f6f3bea UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894767056

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   * a70247f32679a6441cea131e946acce6fd09523e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Run Azure tests on containers [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894760625

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   * 4d759f3b4d6629e738b9b1afe4157c514d6df182 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-6902] Fix a unit test [hudi]

2024-01-16 Thread via GitHub



linliu-code opened a new pull request, #10513:
URL: https://github.com/apache/hudi/pull/10513

   ### Change Logs
   
   As title.
   
   ### Impact
   
   None.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6902] Run Azure tests on different agents [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10512:
URL: https://github.com/apache/hudi/pull/10512#issuecomment-1894713292

   
   ## CI report:
   
   * 0e5a63db2337ae435f17eb956460e22caeea65b3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [SUPPORT] Hudi Delete Partition on AWS Glue [hudi]

2024-01-16 Thread via GitHub



soumilshah1995 commented on issue #8894:
URL: https://github.com/apache/hudi/issues/8894#issuecomment-1894700142

   hey buddy 
   depends on how you have partitioned your tables if you have partitioned 
tables with hive style 
   state='Connecticut. should work 
   
   lets connect on slack for more details :D 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-6902] Run Azure tests on different agents [hudi]

2024-01-16 Thread via GitHub



linliu-code opened a new pull request, #10512:
URL: https://github.com/apache/hudi/pull/10512

   ### Change Logs
   
   Create a agent pool for each job.
   
   ### Impact
   
   Isolate each job.
   
   ### Risk level (write none, low medium or high below)
   
   None.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894665261

   
   ## CI report:
   
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878)
 
   * b8158aa597e89aae3e83bb650bd07847a3f28dd3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21983)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7247]Spark truncate table supports concurrency [hudi]

2024-01-16 Thread via GitHub



bvaradar commented on code in PR #10390:
URL: https://github.com/apache/hudi/pull/10390#discussion_r1454176491


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/TruncateHoodieTableCommand.scala:
##
@@ -68,7 +71,12 @@ case class TruncateHoodieTableCommand(
   val targetPath = new Path(basePath)
   val engineContext = new 
HoodieSparkEngineContext(sparkSession.sparkContext)
   val fs = FSUtils.getFs(basePath, 
sparkSession.sparkContext.hadoopConfiguration)
+  val hoodieWriteConfig = 
HoodieWriteConfig.newBuilder().withPath(basePath).withProps(properties).withEngineType(EngineType.SPARK)
+.build()
+  val transactionManager = new TransactionManager(hoodieWriteConfig, fs)
+  
transactionManager.beginTransaction(org.apache.hudi.common.util.Option.empty(), 
org.apache.hudi.common.util.Option.empty())
   FSUtils.deleteDir(engineContext, fs, targetPath, 
sparkSession.sparkContext.defaultParallelism)
+  
transactionManager.endTransaction(org.apache.hudi.common.util.Option.empty())

Review Comment:
   +1 on using replace commit. This will be truly revertible and aligns with 
other operations. @waywtdcc : Can you make this change ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas

2024-01-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7300:
-
Labels: pull-request-available  (was: )

> Parquet DFS source should support merging schemas
> -
>
> Key: HUDI-7300
> URL: https://issues.apache.org/jira/browse/HUDI-7300
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rohit Mittapalli
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should surface the option to merge schema across the parquet files in a 
> single commit. when using ParquetDFSSource.
>  
> When false the schema is randomly picked from a parquet file (current 
> behavior). When set to true the schema across a commit is merged.
>  
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [HUDI-7300] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894656511

   
   ## CI report:
   
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878)
 
   * b8158aa597e89aae3e83bb650bd07847a3f28dd3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-7300) Parquet DFS source should support merging schemas

2024-01-16 Thread Rohit Mittapalli (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mittapalli updated HUDI-7300:
---
Status: In Progress  (was: Open)

> Parquet DFS source should support merging schemas
> -
>
> Key: HUDI-7300
> URL: https://issues.apache.org/jira/browse/HUDI-7300
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rohit Mittapalli
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We should surface the option to merge schema across the parquet files in a 
> single commit. when using ParquetDFSSource.
>  
> When false the schema is randomly picked from a parquet file (current 
> behavior). When set to true the schema across a commit is merged.
>  
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-7300) Parquet DFS source should support merging schemas

2024-01-16 Thread Rohit Mittapalli (Jira)

Rohit Mittapalli created HUDI-7300:
--

 Summary: Parquet DFS source should support merging schemas
 Key: HUDI-7300
 URL: https://issues.apache.org/jira/browse/HUDI-7300
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Rohit Mittapalli


We should surface the option to merge schema across the parquet files in a 
single commit. when using ParquetDFSSource.

 

When false the schema is randomly picked from a parquet file (current 
behavior). When set to true the schema across a commit is merged.

 

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [MINOR] Handle parsing of all zero timestamps with MDT suffixes. [hudi]

2024-01-16 Thread via GitHub



bvaradar merged PR #10481:
URL: https://github.com/apache/hudi/pull/10481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [MINOR] Handle parsing of all zero timestamps with MDT suffixes. (#10481)

2024-01-16 Thread vbalaji

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 68e703e3a49 [MINOR] Handle parsing of all zero timestamps with MDT 
suffixes. (#10481)
68e703e3a49 is described below

commit 68e703e3a4987a1d9ec6e20fae0ad7436f77bd3c
Author: Prashant Wason 
AuthorDate: Tue Jan 16 14:49:57 2024 -0800

[MINOR] Handle parsing of all zero timestamps with MDT suffixes. (#10481)
---
 .../common/table/timeline/HoodieInstantTimeGenerator.java   |  4 
 .../common/table/timeline/TestHoodieActiveTimeline.java | 13 +
 2 files changed, 17 insertions(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java
index 2e48e40820d..3fb9a0698b6 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java
@@ -90,6 +90,10 @@ public class HoodieInstantTimeGenerator {
   LocalDateTime dt = LocalDateTime.parse(timestampInMillis, 
MILLIS_INSTANT_TIME_FORMATTER);
   return Date.from(dt.atZone(ZoneId.systemDefault()).toInstant());
 } catch (DateTimeParseException e) {
+  // MDT uses timestamps which add suffixes to the instant time. Hence, we 
are checking for all timestamps that start with all zeros.
+  if (timestamp.startsWith(HoodieTimeline.INIT_INSTANT_TS)) {
+return new Date(0);
+  }
   throw new ParseException(e.getMessage(), e.getErrorIndex());
 }
   }
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java
index ce0b5dad335..847d7d9e7b9 100755
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/table/timeline/TestHoodieActiveTimeline.java
@@ -609,6 +609,19 @@ public class TestHoodieActiveTimeline extends 
HoodieCommonTestHarness {
 System.out.println(defaultSecsGranularityDate.getTime());
   }
 
+  @Test
+  public void testAllZeroTimestampParsing() throws ParseException {
+String allZeroTs = "00";
+Date allZeroDate = 
HoodieActiveTimeline.parseDateFromInstantTime(allZeroTs);
+assertEquals(allZeroDate, new Date(0), "Parsing of all zero timestamp 
should succeed");
+
+// MDT uses timestamps which add suffixes to the instant time. These 
should also be parsable for all zero case.
+for (int index = 0; index < 10; ++index) {
+  allZeroDate = HoodieActiveTimeline.parseDateFromInstantTime(allZeroTs + 
"00" + index);
+  assertEquals(allZeroDate, new Date(0), "Parsing of all zero timestamp 
should succeed");
+}
+  }
+
   @Test
   public void testMetadataCompactionInstantDateParsing() throws ParseException 
{
 // default second granularity instant ID

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



rohitmittapalli commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454158271


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   fine by me! will set to false by default then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



xushiyan commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454154841


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   ![Screenshot 2024-01-16 at 4 38 21 
PM](https://github.com/apache/hudi/assets/2701446/9c6730f8-e9f1-41ab-988c-f6242ec8e523)
   
   did a quick check on the doc so it's default false. setting this true will 
introduce behavior changes. we should keep it BWC in pre 1.0 releases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



yihua commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454147802


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")

Review Comment:
   Avoid camelCase in the config naming.  use `.enable_merge_schema` instead.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)
+.withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + 
"source.parquet.dfs.mergeSchema")
+.markAdvanced()

Review Comment:
   add `sinceVersion("1.0.0")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



rohitmittapalli commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894619834

   > @rohitmittapalli can you also file a jira and update the title with the 
jira id pls?
   
   Requested a JIRA account unable to file until that gets approved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [MINOR] Clean default Hadoop configuration values in tests [hudi]

2024-01-16 Thread via GitHub



vinothchandar merged PR #10495:
URL: https://github.com/apache/hudi/pull/10495


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

(hudi) branch master updated: [MINOR] Clean default Hadoop configuration values in tests (#10495)

2024-01-16 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 32ade368d89 [MINOR] Clean default Hadoop configuration values in tests 
(#10495)
32ade368d89 is described below

commit 32ade368d899ede5c8e7854863945864604b5692
Author: Lin Liu <141371752+linliu-c...@users.noreply.github.com>
AuthorDate: Tue Jan 16 14:24:23 2024 -0800

[MINOR] Clean default Hadoop configuration values in tests (#10495)

* [MINOR] Clean default Hadoop configurations for SparkContext

These default Hadoop configurations are not used in Hudi tests.

* Consolidating the code into a helper class

-

Co-authored-by: vinoth chandar 
---
 .../org/apache/hudi/testutils/HoodieClientTestUtils.java   | 14 ++
 .../hudi/testutils/HoodieSparkClientTestHarness.java   |  9 ++---
 .../hudi/testutils/SparkClientFunctionalTestHarness.java   |  1 +
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
index 991c615c35d..55619a2a24b 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java
@@ -53,6 +53,7 @@ import org.apache.hadoop.hbase.io.hfile.CacheConfig;
 import org.apache.hadoop.hbase.io.hfile.HFile;
 import org.apache.hadoop.hbase.io.hfile.HFileScanner;
 import org.apache.spark.SparkConf;
+import org.apache.spark.SparkContext;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
@@ -61,6 +62,7 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
+import java.lang.reflect.Field;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashMap;
@@ -125,6 +127,18 @@ public class HoodieClientTestUtils {
 return SparkRDDReadClient.addHoodieSupport(sparkConf);
   }
 
+  public static void overrideSparkHadoopConfiguration(SparkContext 
sparkContext) {
+try {
+  // Clean the default Hadoop configurations since in our Hudi tests they 
are not used.
+  Field hadoopConfigurationField = 
sparkContext.getClass().getDeclaredField("_hadoopConfiguration");
+  hadoopConfigurationField.setAccessible(true);
+  Configuration testHadoopConfig = new Configuration(false);
+  hadoopConfigurationField.set(sparkContext, testHadoopConfig);
+} catch (NoSuchFieldException | IllegalAccessException e) {
+  LOG.warn(e.getMessage());
+}
+  }
+
   private static HashMap getLatestFileIDsToFullPath(String 
basePath, HoodieTimeline commitTimeline,
 
List commitsToReturn) throws IOException {
 HashMap fileIdToFullPath = new HashMap<>();
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
index 2a83baa018c..59cfcb4bb6d 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
@@ -69,6 +69,8 @@ import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.LocalFileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.spark.SparkConf;
+import org.apache.spark.SparkContext;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.SQLContext;
@@ -191,11 +193,12 @@ public abstract class HoodieSparkClientTestHarness 
extends HoodieWriterClientTes
 }
 
 // Initialize a local spark env
-jsc = new 
JavaSparkContext(HoodieClientTestUtils.getSparkConfForTest(appName + "#" + 
testMethodName));
+SparkConf sc = HoodieClientTestUtils.getSparkConfForTest(appName + "#" + 
testMethodName);
+SparkContext sparkContext = new SparkContext(sc);
+HoodieClientTestUtils.overrideSparkHadoopConfiguration(sparkContext);
+jsc = new JavaSparkContext(sparkContext);
 jsc.setLogLevel("ERROR");
-
 hadoopConf = jsc.hadoopConfiguration();
-
 sparkSession = SparkSession.builder()
 .withExtensions(JFunction.toScala(sparkSessionExtensions -> {
   sparkSessionExtensionsInjector.ifPresent(injector -> 
injector.accept(sparkSessionExtensions));
diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



rohitmittapalli commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454133265


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   I've set default to true as per @nsivabalan's request here: 
https://github.com/apache/hudi/pull/10199#discussion_r1408722685
   
   Essentially the key difference is that the schema will be merged across all 
the parquet files in the commit, in the past the schema would be inherited by 
the first file in the commit. In my opinion, this should be the default case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



xushiyan commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894614464

   @rohitmittapalli can you also file a jira and update the title with the jira 
id pls?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub



xushiyan commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454129825


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   can you clarify by setting this default to true, what is the impact to 
existing pipelines that using this DFS source? should it be false by default to 
be compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-16 Thread via GitHub



vinothchandar commented on code in PR #10492:
URL: https://github.com/apache/hudi/pull/10492#discussion_r1454069459


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionQuick.java:
##
@@ -59,25 +59,34 @@ public void teardown() throws Exception {
   }
 
   protected static Stream testArgs() {
+boolean fullTest = false;
 Stream.Builder b = Stream.builder();
-//only testing row-writer enabled for now
-for (Boolean rowWriterEnable : new Boolean[] {true}) {
-  for (Boolean nullForDeletedCols : new Boolean[] {false, true}) {
-for (Boolean useKafkaSource : new Boolean[] {false, true}) {
-  for (Boolean addFilegroups : new Boolean[] {false, true}) {
-for (Boolean multiLogFiles : new Boolean[] {false, true}) {
-  for (Boolean shouldCluster : new Boolean[] {false, true}) {
-for (String tableType : new String[] {"COPY_ON_WRITE", 
"MERGE_ON_READ"}) {
-  if (!multiLogFiles || tableType.equals("MERGE_ON_READ")) {
-b.add(Arguments.of(tableType, shouldCluster, false, 
rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, 
nullForDeletedCols));
+if (fullTest) {
+  //only testing row-writer enabled for now
+  for (Boolean rowWriterEnable : new Boolean[] {true}) {
+for (Boolean nullForDeletedCols : new Boolean[] {false, true}) {
+  for (Boolean useKafkaSource : new Boolean[] {false, true}) {
+for (Boolean addFilegroups : new Boolean[] {false, true}) {
+  for (Boolean multiLogFiles : new Boolean[] {false, true}) {
+for (Boolean shouldCluster : new Boolean[] {false, true}) {
+  for (String tableType : new String[] {"COPY_ON_WRITE", 
"MERGE_ON_READ"}) {
+if (!multiLogFiles || tableType.equals("MERGE_ON_READ")) {
+  b.add(Arguments.of(tableType, shouldCluster, false, 
rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, 
nullForDeletedCols));
+}
   }
 }
+b.add(Arguments.of("MERGE_ON_READ", false, true, 
rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, 
nullForDeletedCols));
   }
-  b.add(Arguments.of("MERGE_ON_READ", false, true, 
rowWriterEnable, addFilegroups, multiLogFiles, useKafkaSource, 
nullForDeletedCols));
 }
   }
 }
   }
+} else {

Review Comment:
 ```
 String tableType = COW, MOR
 Boolean shouldCluster = true
 Boolean shouldCompact = true
 Boolean rowWriterEnable = true
 Boolean addFilegroups = true
 Boolean multiLogFiles = true
 Boolean useKafkaSource= false, true
 Boolean allowNullForDeletedCols=false,true
 ```
 
 I wonder if we just do sth like this. with new 
file groups, multiple log files, alongside cluster and compaction, should be 
the more complex (superset) scenario. no?
 
 



##
hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestHoodieDeltaStreamerSchemaEvolutionQuick.java:
##
@@ -97,19 +106,27 @@ protected static Stream testReorderedColumn() {
   }
 
   protected static Stream testParamsWithSchemaTransformer() {
+boolean fullTest = false;
 Stream.Builder b = Stream.builder();
-for (Boolean useTransformer : new Boolean[] {false, true}) {
-  for (Boolean setSchema : new Boolean[] {false, true}) {
-for (Boolean rowWriterEnable : new Boolean[] {true}) {
-  for (Boolean nullForDeletedCols : new Boolean[] {false, true}) {
-for (Boolean useKafkaSource : new Boolean[] {false, true}) {
-  for (String tableType : new String[] {"COPY_ON_WRITE", 
"MERGE_ON_READ"}) {
-b.add(Arguments.of(tableType, rowWriterEnable, useKafkaSource, 
nullForDeletedCols, useTransformer, setSchema));
+if (fullTest) {
+  for (Boolean useTransformer : new Boolean[] {false, true}) {
+for (Boolean setSchema : new Boolean[] {false, true}) {
+  for (Boolean rowWriterEnable : new Boolean[] {true}) {
+for (Boolean nullForDeletedCols : new Boolean[] {false, true}) {
+  for (Boolean useKafkaSource : new Boolean[] {false, true}) {
+for (String tableType : new String[] {"COPY_ON_WRITE", 
"MERGE_ON_READ"}) {
+  b.add(Arguments.of(tableType, rowWriterEnable, 
useKafkaSource, nullForDeletedCols, useTransformer, setSchema));
+}
   }
 }
   }
 }
   }
+} else

(hudi) branch master updated (744f2a1b6c0 -> df6e351f31c)

2024-01-16 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 744f2a1b6c0 [HUDI-7286] Flink get hudi index type ignore case 
sensitive (#10476)
 add df6e351f31c [HUDI-6092] Set the timeout for the forked JVM (#10496)

No new revisions were added by this update.

Summary of changes:
 pom.xml | 1 +
 1 file changed, 1 insertion(+)

Re: [PR] [HUDI-6092] Set the timeout for the forked JVM for tests [hudi]

2024-01-16 Thread via GitHub



vinothchandar merged PR #10496:
URL: https://github.com/apache/hudi/pull/10496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] MOR hudi 0.14, Bloom Filters are not being used on query time [hudi]

2024-01-16 Thread via GitHub



bk-mz opened a new issue, #10511:
URL: https://github.com/apache/hudi/issues/10511

   **Describe the problem you faced**
   
   We encountered an issue with MOR table that utilizes metadata bloom filters 
and Parquet bloom filters, and has enabled statistics. When attempting to query 
data, the system does not seem to utilize these bloom filters effectively. 
Instead, all requests result in a full partition scan, regardless of the 
applied filters.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a MOR table and write data using both Parquet bloom filters and 
metadata bloom filters.
   2. Attempt to query the data by applying a filter to one of the columns that 
participate in bloom filtering. Ensure that the filter narrows down the dataset 
size, making the bloom filters more likely to be effective.
   3. Observe that the Spark SQL User Interface (UI) displays a full partition 
scan.
   4. Compare the query latency time for the column with bloom filters (BF) to 
the latency time for the column without bloom filters (non-BF).
   
   **Expected behavior**
   
   The expected behavior is that querying the column with bloom filters (BF) 
should be significantly more efficient than querying the column without bloom 
filters (non-BF).
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   * Spark version : 3.5.0 AWS EMR 7.0.0
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Table write hudi params:
   
   ```properties
   hoodie.bloom.index.filter.type=DYNAMIC_V0
   hoodie.bloom.index.prune.by.ranges=false
   hoodie.bloom.index.use.metadata=true
   hoodie.clean.async=true
   hoodie.cleaner.policy.failed.writes=LAZY
   hoodie.compact.inline.max.delta.commits=5
   hoodie.datasource.hive_sync.database=db_name
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.mode=hms
   hoodie.datasource.hive_sync.partition_fields=year,month,day,hour
   hoodie.datasource.hive_sync.table=table_name
   hoodie.datasource.hive_sync.use_jdbc=false
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.write.partitionpath.field=year,month,day,hour
   hoodie.datasource.write.path=s3://s3_path/table
   hoodie.datasource.write.precombine.field=date_updated_epoch
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.streaming.checkpoint.identifier=main_writer
   hoodie.datasource.write.table.type=MERGE_ON_READ
   hoodie.enable.data.skipping=true
   hoodie.index.type=BLOOM
   hoodie.metadata.enable=true
   hoodie.metadata.index.async=true
   hoodie.metadata.index.bloom.filter.column.list=id,account_id
   hoodie.metadata.index.bloom.filter.enable=true
   hoodie.metadata.index.column.stats.column.list=id,account_id
   hoodie.metadata.index.column.stats.enable=true
   hoodie.metricscompaction.log.blocks.on=true
   hoodie.table.name=table_name
   hoodie.write.concurrency.mode=optimistic_concurrency_control
   hoodie.write.lock.dynamodb.partition_key=table_name
   hoodie.write.lock.dynamodb.region=us-east-1
   hoodie.write.lock.dynamodb.table=hudi-lock
   hoodie.write.lock.num_retries=30
   
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   hoodie.write.lock.wait_time_ms=3
   hoodie.write.lock.wait_time_ms_between_retry=1
   ```
   
   Hadoop parquet properties:
   
   ```properties
   parquet.avro.write-old-list-structure=false
   parquet.bloom.filter.enabled#account_id=true
   parquet.bloom.filter.enabled#id=true
   ```
   
   If I download the file from s3 and then use parquet cli, it will show that 
BF on column is actually used:
   
   ```
   parquet bloom-filter 
fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet
 -c account_id -v account_id1
   
   
   Row group 0:
   

   value account_id1 NOT exists.
   
   parquet bloom-filter 
fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet
 -c account_id -v account_id2
   
   Row group 0:
   

   value account_id2 maybe exists.
   ```
   
   Read part:
   
   ```
   $ spark-sql \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog 
\
   --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
   
--jars=/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \
   --conf spark.executor.cores=8 \
   --conf spark.executor.memory=27G \
   --conf spark.driver.cores=8 \
   --conf spark.driver.memory=27G```
   
   spark-sql (default)> select count(1) as cnt from table_with_bfs where year = 
2024 and month = 1 and day = 5 and account_id = 'id1';
   82
   Time taken: 34.962 seconds, Fetch

Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-16 Thread via GitHub



linliu-code commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1894464626

   @jonvex, when is "fullTest" set to "true"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo