Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888525702

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7289) Fix parameters for Big Query Sync

2024-01-11 Thread Bhavani Sudha (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805906#comment-17805906
 ] 

Bhavani Sudha commented on HUDI-7289:
-

Comments as reported by user from Hudi slack.
 # Support of MoR table type in HoodieMultiTableStreamer - this info is not 
added in the 
[doc|https://hudi.apache.org/docs/hoodie_streaming_ingestion#multitablestreamer]
 and even in 
[repo|https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieMultiTableStreamer.java].
 # Description of Schema Registry configurations are too vague. It will good if 
more details and examples are added, also mentioning which are mandatory and 
optional configs. 
[https://hudi.apache.org/docs/configurations#Hudi-Streamer-Schema-Provider-Configs-advanced-configs]

> Fix parameters for Big Query Sync
> -
>
> Key: HUDI-7289
> URL: https://issues.apache.org/jira/browse/HUDI-7289
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Aditya Goenka
>Priority: Minor
> Fix For: 1.1.0
>
>
> revissit Big Query Sync configs - [https://hudi.apache.org/docs/gcp_bigquery/]
>  
> From a user - 
> Info about {{hoodie.gcp.bigquery.sync.require_partition_filter}} config is 
> missing from [here|https://hudi.apache.org/docs/gcp_bigquery] which is part 
> of Hudi 0.14.1.
> Additionally, info about {{hoodie.gcp.bigquery.sync.base_path}} is not very 
> clear, even the example is not understandable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888518774

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449904273


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result1DF = spark.sql(
+s"select * from hudi_metadata('$identifier')"
+  )
+  result1DF.show(false)

Review Comment:
   Yes, filters can be specified. For example, `select key, filesystemmetadata 
from hudi_metadata('table-name') where filesystemMetadata is not null` gives 
this
   
   ```
   
+--+-+
   |key   |filesystemmetadata   
|
   
+--+-+
   |__all_partitions__|{price=30 -> {0, false}, price=20 -> {0, false}, 
price=10 -> {0, false}} |
   |price=30  
|{7d255a2f-185e-40c3-87a7-1ffec2513d33-0_0-34-80_20240112061638246.parquet -> 
{434873, false}}|
   |price=20  
|{d55a27a9-df00-4aed-b0fa-a491cea86039-0_1-34-81_20240112061638246.parquet -> 
{434874, false}}|
   |price=10  
|{c179d819-3d6d-4864-9099-c944f4e10265-0_2-34-82_20240112061638246.parquet -> 
{434874, false}}|
   
+--+-+
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


codope commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449892857


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>

Review Comment:
   let's write down a few tests with MOR table and compaction/clustering after 
2 commits.



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result1DF = spark.sql(
+s"select * from hudi_metadata('$identifier')"
+  )
+  result1DF.show(false)

Review Comment:
   Can I have filters in the query? Also, it would make more sense to show the 
actual metadata partition type instead of ordinal.



##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logcal/HoodieMetadataTableValuedFunction.scala:
##
@@ -0,0 +1,30 @@
+package org.apache.spark.sql.catalyst.plans.logcal
+
+import org.apache.hudi.common.util.ValidationUtils.checkState
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression}
+import org.apache.spark.sql.catalyst.plans.logical.LeafNode
+
+object HoodieMetadataTableValuedFunction {
+
+  val FUNC_NAME = "hudi_metadata";
+
+  def parseOptions(exprs: Seq[Expression], funcName: String): (String, 
Map[String, String]) = {
+val args = exprs.map(_.eval().toString)
+if (args.size != 1) {
+  throw new AnalysisException(s"Expect arguments (table_name or 
table_path) for function `$funcName`")
+}
+
+val identifier = args.head
+
+(identifier, Map("hoodie.datasource.query.type" -> "snapshot"))

Review Comment:
   Should be snapshot by default. Need to set incremental when users pass 
`as.of.instant` in the query.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888466041

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888390957

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888386370

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7295]solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1888386326

   
   ## CI report:
   
   * ff2507430e08bc31cc0efaddda85281baf0a6ef5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21930)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7286]flink get hudi index type ignore case sensitive. [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10476:
URL: https://github.com/apache/hudi/pull/10476#issuecomment-1888386279

   
   ## CI report:
   
   * 9b05b48912f52d3cd317c78be17b39af8e47225f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21916)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1517] create marker file for every log file (#4913) [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10487:
URL: https://github.com/apache/hudi/pull/10487#issuecomment-1888381077

   
   ## CI report:
   
   * d2bf0fec9b20dbe8c77ce3b9bd297b03545a948c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7295]solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1888381102

   
   ## CI report:
   
   * ff2507430e08bc31cc0efaddda85281baf0a6ef5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21930)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7286]flink get hudi index type ignore case sensitive. [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10476:
URL: https://github.com/apache/hudi/pull/10476#issuecomment-1888381043

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (6819727b3be -> a148bd3e1f7)

2024-01-11 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 6819727b3be [HUDI-6902] Use mvnw command for hadoo-mr test (#10474)
 add a148bd3e1f7 [HUDI-6902] Give minimum memory for unit tests (#10469)

No new revisions were added by this update.

Summary of changes:
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: [PR] [HUDI-6902] Set minimum memory for unit tests [hudi]

2024-01-11 Thread via GitHub


vinothchandar merged PR #10469:
URL: https://github.com/apache/hudi/pull/10469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-6902] Use mvnw command for hadoo-mr test (#10474)

2024-01-11 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6819727b3be [HUDI-6902] Use mvnw command for hadoo-mr test (#10474)
6819727b3be is described below

commit 6819727b3be8e6943af45e21eab9e93e139bbe06
Author: Lin Liu <141371752+linliu-c...@users.noreply.github.com>
AuthorDate: Thu Jan 11 19:23:44 2024 -0800

[HUDI-6902] Use mvnw command for hadoo-mr test (#10474)

The reason is to clean up any orphan resources.
---
 .github/workflows/bot.yml | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/bot.yml b/.github/workflows/bot.yml
index 67c7ac16eaa..a31c2e3ea35 100644
--- a/.github/workflows/bot.yml
+++ b/.github/workflows/bot.yml
@@ -141,20 +141,23 @@ jobs:
   distribution: 'adopt'
   architecture: x64
   cache: maven
+  - name: Generate Maven Wrapper
+run:
+  mvn -N io.takari:maven:wrapper
   - name: Build Project
 env:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
   SPARK_PROFILE: ${{ matrix.sparkProfile }}
   FLINK_PROFILE: ${{ matrix.flinkProfile }}
 run:
-  mvn clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -DskipTests=true -Phudi-platform-service $MVN_ARGS -am -pl 
hudi-hadoop-mr,hudi-client/hudi-java-client
+  ./mvnw clean install -T 2 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -DskipTests=true -Phudi-platform-service $MVN_ARGS -am -pl 
hudi-hadoop-mr,hudi-client/hudi-java-client
   - name: UT - hudi-hadoop-mr and hudi-client/hudi-java-client
 env:
   SCALA_PROFILE: ${{ matrix.scalaProfile }}
   SPARK_PROFILE: ${{ matrix.sparkProfile }}
   FLINK_PROFILE: ${{ matrix.flinkProfile }}
 run:
-  mvn test -Punit-tests -fae -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -pl hudi-hadoop-mr,hudi-client/hudi-java-client $MVN_ARGS
+  ./mvnw test -Punit-tests -fae -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-D"FLINK_PROFILE" -pl hudi-hadoop-mr,hudi-client/hudi-java-client $MVN_ARGS
 
   test-spark-java17:
 runs-on: ubuntu-latest



Re: [PR] [HUDI-6902] Use mvnw command for hadoop-mr test [hudi]

2024-01-11 Thread via GitHub


vinothchandar merged PR #10474:
URL: https://github.com/apache/hudi/pull/10474


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7286]flink get hudi index type ignore case sensitive. [hudi]

2024-01-11 Thread via GitHub


Akihito-Liang commented on PR #10476:
URL: https://github.com/apache/hudi/pull/10476#issuecomment-1888360910

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7295]solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


empcl commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1888355613

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7286]flink get hudi index type ignore case sensitive. [hudi]

2024-01-11 Thread via GitHub


Akihito-Liang commented on PR #10476:
URL: https://github.com/apache/hudi/pull/10476#issuecomment-1888356085

   @hudi-bot run azure
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


majian1998 opened a new pull request, #10493:
URL: https://github.com/apache/hudi/pull/10493

   In the current implementation of data skipping, column statistics for the 
entire table are read and then subjected to data skipping filtering operations 
based on these stats. When the table has a large volume of data and a high 
number of partitions, this approach can reduce the efficiency of data skipping, 
as partition pruning conditions are not utilized.
   
   By pushing down the conditions for partition filtering to after the column 
statistics are read and applying pruning at that point, the size of the column 
stats that are subsequently involved in data skipping will be significantly 
reduced. This not only saves time on later computations but also conserves 
memory.
   
   During a test conducted on a table with a total of 25TB distributed across 
60 subpartitions, a query was performed on one of the subpartitions, which was 
1.4TB in size. Overall, this simple test demonstrated that data skipping can 
lead to a savings of several seconds. In scenarios involving partition pruning, 
time savings are indeed achievable. Additionally, there will be a substantial 
reduction in the memory footprint for the list of candidate files needed for 
further computation.
   
   In scenarios where partition pruning is not applied, this query would only 
result in a minimal increase in cost. This minor cost increase is 
inconsequential either when the data volume is large—making these seconds-level 
overheads negligible—or when the data volume is small, eliminating the need for 
partitioning altogether, in which case the filter operation would not be 
time-consuming.
   
   ### Change Logs
   
   Pushing Down Partition Pruning Conditions to Column Stats During Data 
Skipping
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


majian1998 closed pull request #10485: [HUDI-7291] Pushing Down Partition 
Pruning Conditions to Column Stats Earlier During Data Skipping
URL: https://github.com/apache/hudi/pull/10485


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1517] create marker file for every log file (#4913) [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10487:
URL: https://github.com/apache/hudi/pull/10487#issuecomment-1888290112

   
   ## CI report:
   
   * d2bf0fec9b20dbe8c77ce3b9bd297b03545a948c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21941)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-1517] create marker file for every log file (#4913) [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10487:
URL: https://github.com/apache/hudi/pull/10487#issuecomment-1888283360

   
   ## CI report:
   
   * d2bf0fec9b20dbe8c77ce3b9bd297b03545a948c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1888274766

   
   ## CI report:
   
   * 1b5d4ba50a611488bdc533914c88475ced19fd99 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21938)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888272284

   
   ## CI report:
   
   * 3934af17773124b22860f1d82fe7bb69945e4a9e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7286]flink get hudi index type ignore case sensitive. [hudi]

2024-01-11 Thread via GitHub


danny0405 commented on PR #10476:
URL: https://github.com/apache/hudi/pull/10476#issuecomment-1888240211

   oops, there are some test failures in the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Flink writes MOR table, both RO table and RT table read nothing by hive [hudi]

2024-01-11 Thread via GitHub


danny0405 commented on issue #10465:
URL: https://github.com/apache/hudi/issues/10465#issuecomment-1888239384

   Supported by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7282) Hudi COW APPEND mode can be verified through cluster that even if the index is bucket

2024-01-11 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805848#comment-17805848
 ] 

Danny Chen commented on HUDI-7282:
--

Fixed via master branch: 4c5280297964c5aa3bd4bd7abe893ac36b8ebbcf

> Hudi COW APPEND mode can be verified through cluster that even if the index 
> is bucket
> -
>
> Key: HUDI-7282
> URL: https://issues.apache.org/jira/browse/HUDI-7282
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Junning Liang
>Assignee: Junning Liang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> when use append mode and cluster configuration in the cow table, if it's 
> index.type is Bucket, it would throw UnsupportedOperationException like:
>  
> {code:java}
> java.lang.UnsupportedOperationException: Clustering is not supported for 
> bucket index.
>   at 
> org.apache.hudi.util.ClusteringUtil.validateClusteringScheduling(ClusteringUtil.java:49)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.util.ClusteringUtil.scheduleClustering(ClusteringUtil.java:61)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:288)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_312]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_312]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312] {code}
> But as we also know,append mode is not related to the index type。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7282) Hudi COW APPEND mode can be verified through cluster that even if the index is bucket

2024-01-11 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7282.

Resolution: Fixed

> Hudi COW APPEND mode can be verified through cluster that even if the index 
> is bucket
> -
>
> Key: HUDI-7282
> URL: https://issues.apache.org/jira/browse/HUDI-7282
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Junning Liang
>Assignee: Junning Liang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> when use append mode and cluster configuration in the cow table, if it's 
> index.type is Bucket, it would throw UnsupportedOperationException like:
>  
> {code:java}
> java.lang.UnsupportedOperationException: Clustering is not supported for 
> bucket index.
>   at 
> org.apache.hudi.util.ClusteringUtil.validateClusteringScheduling(ClusteringUtil.java:49)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.util.ClusteringUtil.scheduleClustering(ClusteringUtil.java:61)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:288)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_312]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_312]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312] {code}
> But as we also know,append mode is not related to the index type。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7282) Hudi COW APPEND mode can be verified through cluster that even if the index is bucket

2024-01-11 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7282:
-
Fix Version/s: 1.0.0

> Hudi COW APPEND mode can be verified through cluster that even if the index 
> is bucket
> -
>
> Key: HUDI-7282
> URL: https://issues.apache.org/jira/browse/HUDI-7282
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Junning Liang
>Assignee: Junning Liang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> when use append mode and cluster configuration in the cow table, if it's 
> index.type is Bucket, it would throw UnsupportedOperationException like:
>  
> {code:java}
> java.lang.UnsupportedOperationException: Clustering is not supported for 
> bucket index.
>   at 
> org.apache.hudi.util.ClusteringUtil.validateClusteringScheduling(ClusteringUtil.java:49)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.util.ClusteringUtil.scheduleClustering(ClusteringUtil.java:61)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:288)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
>  
> ~[blob_p-9c5a14ae562d04e25991a03bf9668559004f6a49-984c72deff77d77c6bf8aeffe0ef8bd3:0.13.1-012]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_312]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_312]
>   at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312] {code}
> But as we also know,append mode is not related to the index type。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7282] avoid verification failure due to append writing of the c… [hudi]

2024-01-11 Thread via GitHub


danny0405 commented on PR #10475:
URL: https://github.com/apache/hudi/pull/10475#issuecomment-1888236013

   Not sure why a append table could have bucket index type, but at least it's 
a proactive protection logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7282] Avoid verification failure due to append writing of the cow table with cluster configuration when the index is bucket. (#10475)

2024-01-11 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4c528029796 [HUDI-7282] Avoid verification failure due to append 
writing of the cow table with cluster configuration when the index is bucket. 
(#10475)
4c528029796 is described below

commit 4c5280297964c5aa3bd4bd7abe893ac36b8ebbcf
Author: akido <37492907+akihito-li...@users.noreply.github.com>
AuthorDate: Fri Jan 12 09:11:30 2024 +0800

[HUDI-7282] Avoid verification failure due to append writing of the cow 
table with cluster configuration when the index is bucket. (#10475)
---
 .../src/main/java/org/apache/hudi/util/ClusteringUtil.java|  2 +-
 .../test/java/org/apache/hudi/utils/TestClusteringUtil.java   | 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClusteringUtil.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClusteringUtil.java
index 75d4ea79815..ac81b4e7af4 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClusteringUtil.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/ClusteringUtil.java
@@ -49,7 +49,7 @@ public class ClusteringUtil {
   private static final Logger LOG = 
LoggerFactory.getLogger(ClusteringUtil.class);
 
   public static void validateClusteringScheduling(Configuration conf) {
-if (OptionsResolver.isBucketIndexType(conf)) {
+if (!OptionsResolver.isAppendMode(conf) && 
OptionsResolver.isBucketIndexType(conf)) {
   HoodieIndex.BucketIndexEngineType bucketIndexEngineType = 
OptionsResolver.getBucketEngineType(conf);
   switch (bucketIndexEngineType) {
 case SIMPLE:
diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestClusteringUtil.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestClusteringUtil.java
index e9433d036ca..ca8718289d9 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestClusteringUtil.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestClusteringUtil.java
@@ -32,6 +32,7 @@ import org.apache.hudi.common.util.ClusteringUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.configuration.FlinkOptions;
 import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
 import org.apache.hudi.table.HoodieFlinkTable;
 import org.apache.hudi.util.ClusteringUtil;
 import org.apache.hudi.util.FlinkTables;
@@ -113,6 +114,16 @@ public class TestClusteringUtil {
 
.stream().map(HoodieInstant::getTimestamp).collect(Collectors.toList());
 assertThat(actualInstants, is(oriInstants));
   }
+  
+  @Test
+  void validateClusteringScheduling() throws Exception {
+beforeEach();
+ClusteringUtil.validateClusteringScheduling(this.conf);
+
+// validate bucket index
+this.conf.setString(FlinkOptions.INDEX_TYPE, 
HoodieIndex.IndexType.BUCKET.name());
+ClusteringUtil.validateClusteringScheduling(this.conf);
+  }
 
   /**
* Generates a clustering plan on the timeline and returns its instant time.



Re: [PR] [HUDI-7282] avoid verification failure due to append writing of the c… [hudi]

2024-01-11 Thread via GitHub


danny0405 merged PR #10475:
URL: https://github.com/apache/hudi/pull/10475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Parallelized the check for existence of files in IncrementalRelation. (#10480)

2024-01-11 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d9b1bf9429d [MINOR] Parallelized the check for existence of files in 
IncrementalRelation. (#10480)
d9b1bf9429d is described below

commit d9b1bf9429d88d3d2989b0a5fc4efb39e0af7b6c
Author: Prashant Wason 
AuthorDate: Thu Jan 11 17:06:50 2024 -0800

[MINOR] Parallelized the check for existence of files in 
IncrementalRelation. (#10480)

This speedups the check for large datasets when a very large number of 
files need to be checked for existence.
---
 .../src/main/scala/org/apache/hudi/IncrementalRelation.scala  | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala
index 227b585c9ef..6566c450477 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala
@@ -24,6 +24,7 @@ import 
org.apache.hudi.HoodieBaseRelation.isSchemaEvolutionEnabledOnRead
 import org.apache.hudi.HoodieSparkConfUtils.getHollowCommitHandling
 import org.apache.hudi.client.common.HoodieSparkEngineContext
 import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.config.SerializableConfiguration
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.{HoodieCommitMetadata, HoodieFileFormat, 
HoodieRecord, HoodieReplaceCommitMetadata}
 import 
org.apache.hudi.common.table.timeline.TimelineUtils.HollowCommitHandling.USE_TRANSITION_TIME
@@ -239,11 +240,17 @@ class IncrementalRelation(val sqlContext: SQLContext,
   var doFullTableScan = false
 
   if (fallbackToFullTableScan) {
-val fs = 
basePath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration);
+// val fs = 
basePath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration);
 val timer = HoodieTimer.start
 
 val allFilesToCheck = filteredMetaBootstrapFullPaths ++ 
filteredRegularFullPaths
-val firstNotFoundPath = allFilesToCheck.find(path => 
!fs.exists(new Path(path)))
+val serializedConf = new 
SerializableConfiguration(sqlContext.sparkContext.hadoopConfiguration)
+val localBasePathStr = basePath.toString
+val firstNotFoundPath = 
sqlContext.sparkContext.parallelize(allFilesToCheck.toSeq, allFilesToCheck.size)
+  .map(path => {
+val fs = new 
Path(localBasePathStr).getFileSystem(serializedConf.get)
+fs.exists(new Path(path))
+  }).collect().find(v => !v)
 val timeTaken = timer.endTimer()
 log.info("Checking if paths exists took " + timeTaken + "ms")
 



Re: [PR] [MINOR] Parallelized the check for existence of files in IncrementalRelation. [hudi]

2024-01-11 Thread via GitHub


danny0405 merged PR #10480:
URL: https://github.com/apache/hudi/pull/10480


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888205167

   
   ## CI report:
   
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21936)
 
   * 3934af17773124b22860f1d82fe7bb69945e4a9e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1888205602

   
   ## CI report:
   
   * c262717fa9b3158690de5f6030c84ae6262b9c74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21935)
 
   * 1b5d4ba50a611488bdc533914c88475ced19fd99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21938)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark writes data in eight hours [hudi]

2024-01-11 Thread via GitHub


bajiaolong closed issue #10440: [SUPPORT] Spark writes data in eight hours
URL: https://github.com/apache/hudi/issues/10440


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888132678

   
   ## CI report:
   
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21936)
 
   * 3934af17773124b22860f1d82fe7bb69945e4a9e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1888092990

   
   ## CI report:
   
   * c262717fa9b3158690de5f6030c84ae6262b9c74 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21935)
 
   * 1b5d4ba50a611488bdc533914c88475ced19fd99 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21938)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888092435

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21936)
 
   * 3934af17773124b22860f1d82fe7bb69945e4a9e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21937)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1888085542

   
   ## CI report:
   
   * c262717fa9b3158690de5f6030c84ae6262b9c74 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21935)
 
   * 1b5d4ba50a611488bdc533914c88475ced19fd99 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888084909

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21936)
 
   * 3934af17773124b22860f1d82fe7bb69945e4a9e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888025754

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21936)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1888015211

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   * fcbecfc34daac4a9ec66b71d228862eb213119b1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1888004714

   
   ## CI report:
   
   * c262717fa9b3158690de5f6030c84ae6262b9c74 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21935)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


yihua commented on code in PR #10241:
URL: https://github.com/apache/hudi/pull/10241#discussion_r1449417384


##
hudi-io/src/test/java/org/apache/hudi/io/hfile/TestHFileReader.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.io.hfile;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.IOException;
+import java.util.Optional;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+
+public class TestHFileReader {

Review Comment:
   More tests are added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10492:
URL: https://github.com/apache/hudi/pull/10492#issuecomment-1887943396

   
   ## CI report:
   
   * c262717fa9b3158690de5f6030c84ae6262b9c74 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1887932610

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21934)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7296) Reduce combinations for some tests to make ci faster

2024-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7296:
-
Labels: pull-request-available  (was: )

> Reduce combinations for some tests to make ci faster
> 
>
> Key: HUDI-7296
> URL: https://issues.apache.org/jira/browse/HUDI-7296
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many 
> combinations of params. While it is good to test everything, there are lots 
> of code paths that have extensive duplicate testing. Reduce the number of 
> tests while still maintaining code coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7296] Reduce CI Time by Minimizing Duplicate Code Coverage in Tests [hudi]

2024-01-11 Thread via GitHub


jonvex opened a new pull request, #10492:
URL: https://github.com/apache/hudi/pull/10492

   ### Change Logs
   
   testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many 
combinations of params. While it is good to test everything, there are lots of 
code paths that have extensive duplicate testing. Reduce the number of tests 
while still maintaining code coverage
   
   ### Impact
   
   faster CI
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7296) Reduce combinations for some tests to make ci faster

2024-01-11 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7296:
-

 Summary: Reduce combinations for some tests to make ci faster
 Key: HUDI-7296
 URL: https://issues.apache.org/jira/browse/HUDI-7296
 Project: Apache Hudi
  Issue Type: Test
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


testBootstrapRead and TestHoodieDeltaStreamerSchemaEvolutionQuick have many 
combinations of params. While it is good to test everything, there are lots of 
code paths that have extensive duplicate testing. Reduce the number of tests 
while still maintaining code coverage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1887922159

   
   ## CI report:
   
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21933)
 
   * 5330999cce0b2b985de3077eb379528910700cd1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7170] Implement HFile reader independent of HBase [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10241:
URL: https://github.com/apache/hudi/pull/10241#issuecomment-1887864995

   
   ## CI report:
   
   * 78c32e76253bb0db70f289bf88f9560545b5819e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21921)
 
   * 064eb14f04ec9a4c58d6952c12908ad4165945bf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7293]Incremental read of insert table using rebalance strategy [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10490:
URL: https://github.com/apache/hudi/pull/10490#issuecomment-1887749893

   
   ## CI report:
   
   * 2a2b8ac8d8a32e4e285080ff535d08fdf8a7e687 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21931)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (593ea85da20 -> b861ceff179)

2024-01-11 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 593ea85da20 [HUDI-7288] Fix ArrayIndexOutOfBoundsException when 
upgrade nonPartitionedTable created by 0.10/0.11 HUDI version (#10482)
 add b861ceff179 [MINOR] Turning on publishing of test results to Azure 
Devops (#10477)

No new revisions were added by this update.

Summary of changes:
 azure-pipelines-20230430.yml | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)



Re: [PR] [MINOR] Turning on publishing of test results to Azure Devops [hudi]

2024-01-11 Thread via GitHub


xushiyan merged PR #10477:
URL: https://github.com/apache/hudi/pull/10477


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7295]solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1887660952

   
   ## CI report:
   
   * ff2507430e08bc31cc0efaddda85281baf0a6ef5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21930)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-4678][RFC-61] RFC for Snapshot view management [hudi]

2024-01-11 Thread via GitHub


nsivabalan commented on code in PR #6576:
URL: https://github.com/apache/hudi/pull/6576#discussion_r1449186236


##
rfc/rfc-61/rfc-61.md:
##
@@ -0,0 +1,244 @@
+
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @
+
+## Approvers
+ - @
+ - @
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, and also remove savepoint if already expire. 
+the situation is there no life-cycle management of save points, that bring 
inconvenience for the users
+ 
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+
+1. Basic idea:
+
+![basic_design](./basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+* Create Snapshot views periodically by time(date time/processing time)
+* Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Snapshots are stored as tables in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+   * The space usage becomes (1 + (t-1) * p)/t
+   * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-MMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named -archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+   "type": "record",
+   "name": "HoodieSavepointMetadata",
+   "namespace": "org.apache.hudi.avro.model",
+   "fields": [{
+   "name": "savepointedBy",
+   "type": {
+   "type": "string",
+   "avro.java.string": "String"
+   }, {
+   "na

[jira] [Assigned] (HUDI-7270) Support schema evolution by Flink SQL using HoodieCatalog

2024-01-11 Thread Jing Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhang reassigned HUDI-7270:


Assignee: Jing Zhang

> Support schema evolution by Flink SQL using HoodieCatalog
> -
>
> Key: HUDI-7270
> URL: https://issues.apache.org/jira/browse/HUDI-7270
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>
> Since Flink 1.17, Flink SQL support more advanced alter table syntax.
> {code:sql}
> -- add a new column 
> ALTER TABLE MyTable ADD category_id STRING COMMENT 'identifier of the 
> category';
> -- modify a column type, comment and position
> ALTER TABLE MyTable MODIFY measurement double COMMENT 'unit is bytes per 
> second' AFTER `id`;
> -- drop columns
> ALTER TABLE MyTable DROP (col1, col2, col3);
> -- rename column
> ALTER TABLE MyTable RENAME request_body TO payload;
> {code}
> Find more detail information in [Flink Alter Table SQL 
> |https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/table/sql/alter/].
> We could support schema evolution by Flink SQL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7293]Incremental read of insert table using rebalance strategy [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10490:
URL: https://github.com/apache/hudi/pull/10490#issuecomment-1887462462

   
   ## CI report:
   
   * 2a2b8ac8d8a32e4e285080ff535d08fdf8a7e687 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21931)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7293]Incremental read of insert table using rebalance strategy [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10490:
URL: https://github.com/apache/hudi/pull/10490#issuecomment-1887446741

   
   ## CI report:
   
   * 2a2b8ac8d8a32e4e285080ff535d08fdf8a7e687 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449042479


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result1DF = spark.sql(
+s"select * from hudi_metadata('$identifier')"
+  )
+  result1DF.show(false)

Review Comment:
   @codope could not figure out a way to format/decorate the output RDD 
generated by `MergeOnReadSnapshotRelation`. I was hoping that such a trait 
exists which a relation could implement (which allows modifying RDDs generated 
by `buildScan()`, but did not find anything obvious in the documentation. Let's 
discuss if this output is okay?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449042479


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result1DF = spark.sql(
+s"select * from hudi_metadata('$identifier')"
+  )
+  result1DF.show(false)

Review Comment:
   @codope could not figure out a way to format/decorate the output RDD 
generated by `MergeOnReadSnapshotRelation`. I was hoping that exists a Trait 
and a relation could implement which allows that, but did not find anything 
obvious in the documentation. Let's discuss if this output is okay?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7295]solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1887431358

   
   ## CI report:
   
   * ff2507430e08bc31cc0efaddda85281baf0a6ef5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21930)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449039323


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestHoodieTableValuedFunction.scala:
##
@@ -558,4 +558,50 @@ class TestHoodieTableValuedFunction extends 
HoodieSparkSqlTestBase {
   }
 }
   }
+
+  test(s"Test hudi_metadata Table-Valued Function") {
+if (HoodieSparkUtils.gteqSpark3_2) {
+  withTempDir { tmp =>
+Seq("cow").foreach { tableType =>
+  val tableName = generateTableName
+  val identifier = tableName
+  spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
+  spark.sql(
+s"""
+   |create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long,
+   |  price int
+   |) using hudi
+   |partitioned by (price)
+   |tblproperties (
+   |  type = '$tableType',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.recordkey.field = 'id',
+   |  hoodie.metadata.record.index.enable = 'true',
+   |  hoodie.metadata.index.column.stats.enable = 'true',
+   |  hoodie.metadata.index.column.stats.column.list = 'price'
+   |)
+   |location '${tmp.getCanonicalPath}/$tableName'
+   |""".stripMargin
+  )
+
+  spark.sql(
+s"""
+   | insert into $tableName
+   | values (1, 'a1', 1000, 10), (2, 'a2', 2000, 20), (3, 'a3', 
3000, 30)
+   | """.stripMargin
+  )
+
+  val result1DF = spark.sql(
+s"select * from hudi_metadata('$identifier')"
+  )
+  result1DF.show(false)

Review Comment:
   The output here looks like following:
   ```
   
+++-+---+--+-+
   |key |type|filesystemMetadata
   
|BloomFilterMetadata|ColumnStatsMetadata


   |recordIndexMetadata 
 |
   
+++-+---+--+-+
   |XV1ds8/f890=Y3BB5pb5nN8=w+ks42Peh7IzHkz0AvRedw==|3   |null  
   |null
   
|{2f2a4323-62df-4cf1-b1d5-91ed26f082f5-0_1-34-81_2024052056298.parquet, 
price, {null, {20}, null, null, null, null, null, null, null, null, null}, 
{null, {20}, null, null, null, null, null, null, null, null, null}, 1, 0, 51, 
33, false}|null 
|
   |XV1ds8/f890=hmXBJvAQ24Y=8UOwwsuyxb+TPm8R/YquZw==|3   |null  
   |null
   
|{89fdfe8c-f77c-4917-b899-4ceba5cba32e-0_0-34-80_2024052056298.parquet, 
price, {null, {30}, null, null, null, null, null, null, null, null, null}, 
{null, {30}, null, null, null, null, null, null, null, null, null}, 1, 0, 51, 
33, false}|null 
|
   |XV1ds8/f890=/vI1OU7mFjI=8NI1pwlr88Hwgi5PUlSCiQ==|3   |null  
   |null
   
|{6fdc628a-b86a-42b4-8d66-3ae15fd11f9b-0_2-34-82_2024052056298.parquet, 
price, {null, {10}, null, null, null, null, null, null, null, null, null}, 
{null, {10}, null, null, null, null, null, null, null, null, null}, 1, 0, 51, 
33, false}|null 
|
   |1   |5   |null

[jira] [Updated] (HUDI-7295) split is arranged in ascending order of instant

2024-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7295:
-
Labels: pull-request-available  (was: )

> split is arranged in ascending order of instant
> ---
>
> Key: HUDI-7295
> URL: https://issues.apache.org/jira/browse/HUDI-7295
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
>
>  Splits are forwarded downstream for reading in ascending instant commits 
> time order,but currently it's out of order



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay commented on code in PR #10491:
URL: https://github.com/apache/hudi/pull/10491#discussion_r1449035307


##
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSpark32PlusAnalysis.scala:
##
@@ -134,6 +134,21 @@ case class HoodieSpark32PlusResolveReferences(spark: 
SparkSession) extends Rule[
   catalogTable.location.toString))
 LogicalRelation(relation, catalogTable)
   }
+case HoodieMetadataTableValuedFunction(args) =>
+  val (tablePath, opts) = 
HoodieMetadataTableValuedFunction.parseOptions(args, 
HoodieMetadataTableValuedFunction.FUNC_NAME)
+  val hoodieDataSource = new DefaultSource
+  if (tablePath.contains(Path.SEPARATOR)) {
+// the first param is table path
+val relation = hoodieDataSource.createRelation(spark.sqlContext, opts 
++ Map("path" -> (tablePath + "/.hoodie/metadata")))
+LogicalRelation(relation)
+  } else {
+// the first param is table identifier
+val tableId = 
spark.sessionState.sqlParser.parseTableIdentifier(tablePath)
+val catalogTable = spark.sessionState.catalog.getTableMetadata(tableId)
+val relation = hoodieDataSource.createRelation(spark.sqlContext, opts 
++ Map("path" ->
+  (catalogTable.location.toString + "/.hoodie/metadata")))

Review Comment:
   Could not find any way to format and change the output RDD generated by the 
relation. Thought about subclassing the `MergeOnReadSnapshotRelation` with some 
trait that allows to decorate/modify the generated RDD, but could not find any 
such trait. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7294) Add TVF to query hudi metadata

2024-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7294:
-
Labels: pull-request-available  (was: )

> Add TVF to query hudi metadata
> --
>
> Key: HUDI-7294
> URL: https://issues.apache.org/jira/browse/HUDI-7294
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: pull-request-available
>
> Having a table valued function to query hudi metadata for a given table 
> through spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7294] [WIP] TVF to query hudi metadata [hudi]

2024-01-11 Thread via GitHub


bhat-vinay opened a new pull request, #10491:
URL: https://github.com/apache/hudi/pull/10491

   Adds a TVF function to query hudi metadata through spark-sql. Since the 
metadata is already a MOR table, it simply creates a 'snapshot' on a MOR 
relation. Could not find any way to format (or filter) the RDD generated by the 
MOR snapshot relation. Uploading the PR to get some feedback.
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7293) Incremental read of insert table using rebalance strategy

2024-01-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7293:
-
Labels: pull-request-available  (was: )

> Incremental read of insert table using rebalance strategy
> -
>
> Key: HUDI-7293
> URL: https://issues.apache.org/jira/browse/HUDI-7293
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: 陈磊
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-01-11-22-47-03-463.png, 
> image-2024-01-11-22-50-09-512.png
>
>
> For insert type tables, we do not need to use keyby to distribute inputsplit 
> to avoid data skewing issues with the split reader operator
> !image-2024-01-11-22-50-09-512.png|width=606,height=197!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7293]Incremental read of insert table using rebalance strategy [hudi]

2024-01-11 Thread via GitHub


empcl opened a new pull request, #10490:
URL: https://github.com/apache/hudi/pull/10490

   ### Change Logs
   
   _For insert type tables, we do not need to use keyby to distribute 
inputsplit to avoid data skewing issues with the split reader operator._
   
   ### Impact
   
   _none._
   
   ### Risk level (write none, low medium or high below)
   
   _none._
   
   ### Documentation Update
   
   _none_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7295) split is arranged in ascending order of instant

2024-01-11 Thread Jira
陈磊 created HUDI-7295:


 Summary: split is arranged in ascending order of instant
 Key: HUDI-7295
 URL: https://issues.apache.org/jira/browse/HUDI-7295
 Project: Apache Hudi
  Issue Type: Bug
Reporter: 陈磊


 Splits are forwarded downstream for reading in ascending instant commits time 
order,but currently it's out of order



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7294) Add TVF to query hudi metadata

2024-01-11 Thread Vinaykumar Bhat (Jira)
Vinaykumar Bhat created HUDI-7294:
-

 Summary: Add TVF to query hudi metadata
 Key: HUDI-7294
 URL: https://issues.apache.org/jira/browse/HUDI-7294
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Vinaykumar Bhat
Assignee: Vinaykumar Bhat


Having a table valued function to query hudi metadata for a given table through 
spark-sql will help in debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10489:
URL: https://github.com/apache/hudi/pull/10489#issuecomment-1887354583

   
   ## CI report:
   
   * ff2507430e08bc31cc0efaddda85281baf0a6ef5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7293) Incremental read of insert table using rebalance strategy

2024-01-11 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

陈磊 updated HUDI-7293:
-
Description: 
For insert type tables, we do not need to use keyby to distribute inputsplit to 
avoid data skewing issues with the split reader operator

!image-2024-01-11-22-50-09-512.png|width=606,height=197!

 

  was:
For insert type tables, we do not need to use keyby to distribute inputsplit to 
avoid data skewing issues with the split reader operator

!image-2024-01-11-22-50-09-512.png!

 


> Incremental read of insert table using rebalance strategy
> -
>
> Key: HUDI-7293
> URL: https://issues.apache.org/jira/browse/HUDI-7293
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: 陈磊
>Priority: Major
> Attachments: image-2024-01-11-22-47-03-463.png, 
> image-2024-01-11-22-50-09-512.png
>
>
> For insert type tables, we do not need to use keyby to distribute inputsplit 
> to avoid data skewing issues with the split reader operator
> !image-2024-01-11-22-50-09-512.png|width=606,height=197!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7293) Incremental read of insert table using rebalance strategy

2024-01-11 Thread Jira
陈磊 created HUDI-7293:


 Summary: Incremental read of insert table using rebalance strategy
 Key: HUDI-7293
 URL: https://issues.apache.org/jira/browse/HUDI-7293
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: 陈磊
 Attachments: image-2024-01-11-22-47-03-463.png, 
image-2024-01-11-22-50-09-512.png

For insert type tables, we do not need to use keyby to distribute inputsplit to 
avoid data skewing issues with the split reader operator

!image-2024-01-11-22-50-09-512.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] solving the problem of disordered output split in incremental read sc… [hudi]

2024-01-11 Thread via GitHub


empcl opened a new pull request, #10489:
URL: https://github.com/apache/hudi/pull/10489

   …enarios
   
   ### Change Logs
   
   _solving the problem of disordered output split in incremental read 
scenarios._
   
   ### Impact
   
   _none._
   
   ### Risk level (write none, low medium or high below)
   
   _Inone._
   
   ### Documentation Update
   
   _none_
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1887321758

   
   ## CI report:
   
   * 29df255ea97787211cbaf2900ce7cfabf794157a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21925)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887322570

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887214366

   
   ## CI report:
   
   * 47f54ad61bb63148eb32041921ea6eaa725566bf Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21927)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1887213648

   
   ## CI report:
   
   * 29df255ea97787211cbaf2900ce7cfabf794157a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21925)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-11 Thread via GitHub


KnightChess commented on PR #9640:
URL: https://github.com/apache/hudi/pull/9640#issuecomment-1887202939

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


KnightChess commented on PR #10191:
URL: https://github.com/apache/hudi/pull/10191#issuecomment-1887202137

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7529] fix multiple tasks get the lock at the same time when use… [hudi]

2024-01-11 Thread via GitHub


KnightChess commented on PR #10412:
URL: https://github.com/apache/hudi/pull/10412#issuecomment-1887201534

   @danny0405 hi, is there any other modification suggestions for this 
question? Or this pr no need land?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-11 Thread via GitHub


KnightChess closed pull request #9640: [MINOR] change hive/adb tool not auto 
create database default
URL: https://github.com/apache/hudi/pull/9640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] change hive/adb tool not auto create database default [hudi]

2024-01-11 Thread via GitHub


KnightChess opened a new pull request, #9640:
URL: https://github.com/apache/hudi/pull/9640

   ### Change Logs
   
   usually, `create database`  usually operate at a higher level, like `Project 
management platform`, and require higher permissions to create. It is usually 
not created by the etl task itself, do not meet common business needs.
   
   ### Impact
   
   will not auto create database default
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6207] spark support bucket index query for table with bucket index [hudi]

2024-01-11 Thread via GitHub


KnightChess closed pull request #10191: [HUDI-6207] spark support bucket index 
query for table with bucket index
URL: https://github.com/apache/hudi/pull/10191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1887171861

   
   ## CI report:
   
   * 55a5918fb3706f76a41b9fba793c777566e09363 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21929)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Hard deletion using deltastreamer [hudi]

2024-01-11 Thread via GitHub


ad1happy2go commented on issue #10483:
URL: https://github.com/apache/hudi/issues/10483#issuecomment-1887160093

   @Kangho-Lee So you want to update the old data also,  The only way is to 
re-ingest that old data so it follow the upsert path again with 
`_hoodie_is_deleted` and do the needful. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7292) NullPointerException while reading using reconcile schema and schema on read

2024-01-11 Thread Aditya Goenka (Jira)
Aditya Goenka created HUDI-7292:
---

 Summary: NullPointerException while reading using reconcile schema 
and schema on read
 Key: HUDI-7292
 URL: https://issues.apache.org/jira/browse/HUDI-7292
 Project: Apache Hudi
  Issue Type: Bug
  Components: reader-core
Reporter: Aditya Goenka
 Fix For: 1.1.0


Github Issue - 
https://github.com/apache/hudi/issues/10488#issuecomment-1887107501



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] Not Able to df.count When hoodie.schema.on.read.enable=true [hudi]

2024-01-11 Thread via GitHub


ad1happy2go commented on issue #10488:
URL: https://github.com/apache/hudi/issues/10488#issuecomment-1887114014

   Created JIRA for tracking this - 
https://issues.apache.org/jira/browse/HUDI-7292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Not Able to df.count When hoodie.schema.on.read.enable=true [hudi]

2024-01-11 Thread via GitHub


ad1happy2go commented on issue #10488:
URL: https://github.com/apache/hudi/issues/10488#issuecomment-1887107501

   Thanks @Amar1404 for raising this. You are right. reads are failing when 
when hoodie.schema.on.read.enable is true. Below is simple. We are facing 
exception when both `hoodie.datasource.write.reconcile.schema` and 
`hoodie.schema.on.read.enable` are true. When any one is false, it works well.
   
   Reproducible code - 
   
   ```
   columns = ["ts","uuid","rider","driver","fare","city"]
   data 
=[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
  
(1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70
 ,"san_francisco"),
  
(1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90
 ,"san_francisco"),
  
(1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"),
  
(169511511,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai")]
   inserts = spark.createDataFrame(data).toDF(*columns)
   hudi_options= {
   "hoodie.datasource.write.reconcile.schema" : "true",
   "hoodie.schema.on.read.enable" : "true",
   "hoodie.datasource.write.partitionpath.field" : "city",
   'hoodie.datasource.write.recordkey.field': 'uuid',
   'hoodie.datasource.write.precombine.field': 'ts',
   'hoodie.table.name':'hudi_table'
   }
   
   inserts.write.format("hudi"). \
   options(**hudi_options). \
   mode("overwrite"). \
   save(PATH)
   
   print(spark.read.format("hudi").options(**hudi_options).load(PATH).count())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Inconsistent Checkpoint Size in Flink Applications with MoR [hudi]

2024-01-11 Thread via GitHub


FranMorilloAWS commented on issue #10329:
URL: https://github.com/apache/hudi/issues/10329#issuecomment-1887082148

   Are there any improvement or features coming for bucket index, to allow 
updating multiple partitions or changing the number of buckets?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1887067482

   
   ## CI report:
   
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   * 55a5918fb3706f76a41b9fba793c777566e09363 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21929)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] Not Able to df.count When hoodie.schema.on.read.enable=true [hudi]

2024-01-11 Thread via GitHub


Amar1404 opened a new issue, #10488:
URL: https://github.com/apache/hudi/issues/10488

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   I am not able to do simple count when hoodie.schema.on.read.enable=true
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   
   
   val hudiOptions= Map(
   "hoodie.parquet.compression.codec"->"zstd",
   "hoodie.datasource.write.hive_style_partitioning"->"true",
   "hoodie.embed.timeline.server"->"true",
 "hoodie.datasource.write.reconcile.schema"-> "false",
 "hoodie.schema.on.read.enable"-> "true",
 "hoodie.datasource.write.keygenerator.class"-> 
"org.apache.hudi.keygen.SimpleKeyGenerator",
   "hoodie.metadata.enable"->"true",
   "hoodie.index.type"->"BLOOM",
   )
   
   val columns = Seq("ts","uuid","rider","driver","fare","city")
   val data =
 
Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
   
(1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70
 ,"san_francisco"),
   
(1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90
 ,"san_francisco"),
   
(1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"),
   
(169511511L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"))
   
   
   var inserts = spark.createDataFrame(data).toDF(columns:_*)
   
   
   inserts.write.format("org.apache.hudi")
   .option(OPERATION_OPT_KEY, "insert").
 option(PARTITIONPATH_FIELD_OPT_KEY, "city")
 .option(PRECOMBINE_FIELD_OPT_KEY, "ts")
 .option(RECORDKEY_FIELD_OPT_KEY, "uuid")
 .option(TABLE_NAME, "test_hudi")
.options(hudiOptions)
 .mode(Overwrite).
 save(Path)
 
 spark.read.format(hudi).options(hudiOptions).load(Path).count
   
   **Expected behavior**
   
   The output should be shown by not working
   
   **Environment Description**
   
   * Hudi version : 0.12.3,0.14.0
   
   * Spark version : 3.3
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   The Spark SQL phase planning failed with an internal error. Please, fill a 
bug report in, and provide the full stack trace.
 at 
org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:542)
 at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:554)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:213)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
 at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:212)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:153)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:146)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:213)
 at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:552)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:213)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
 at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:212)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:159)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$5(QueryExecution.scala:298)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:657)
 at 
org.apache.spark.sql.execution.QueryExecution.writePlans(QueryExecution.scala:298)
 at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:313)
 at 
org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:267)
 at 
org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:246)
 at 
org.apache.spark.sql.execution.SQLExecuti

Re: [I] [SUPPORT] Inconsistent Checkpoint Size in Flink Applications with MoR [hudi]

2024-01-11 Thread via GitHub


FranMorilloAWS commented on issue #10329:
URL: https://github.com/apache/hudi/issues/10329#issuecomment-1887013947

   Then how would you recommend to work with Hudi Tables that can grow to 
infinite rows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886985630

   
   ## CI report:
   
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   * 55a5918fb3706f76a41b9fba793c777566e09363 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (HUDI-6949) Spark support non-blocking concurrency control

2024-01-11 Thread Jing Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhang resolved HUDI-6949.
--

> Spark support non-blocking concurrency control
> --
>
> Key: HUDI-6949
> URL: https://issues.apache.org/jira/browse/HUDI-6949
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark, spark-sql
>Reporter: Jing Zhang
>Assignee: Jing Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   >