[jira] [Assigned] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7824:
-

Assignee: sivabalan narayanan

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11375:
URL: https://github.com/apache/hudi/pull/11375#issuecomment-2143278909

   
   ## CI report:
   
   * 97933909750b810570745044912e9506bcb0acf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24181)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11375:
URL: https://github.com/apache/hudi/pull/11375#issuecomment-2143275364

   
   ## CI report:
   
   * 97933909750b810570745044912e9506bcb0acf2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2143273073

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 6abd40f1b77feb86cdc95d58cd2285c546a1f63e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24180)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-05-31 Thread via GitHub


nsivabalan commented on code in PR #11375:
URL: https://github.com/apache/hudi/pull/11375#discussion_r1623138103


##
hudi-client/hudi-client-common/src/test/resources/mockito-extensions/org.mockito.plugins.MockMaker:
##
@@ -0,0 +1 @@
+mock-maker-inline

Review Comment:
   looks like we need this for static mocking to work. Could not get it to work 
otherwise. 
   
https://stackoverflow.com/questions/21105403/mocking-static-methods-with-mockito
 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7824:
-
Labels: pull-request-available  (was: )

> Fix incremental partitions fetch logic when savepoint is removed for Incr 
> cleaner
> -
>
> Key: HUDI-7824
> URL: https://issues.apache.org/jira/browse/HUDI-7824
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
> and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
> removed later, cleaner should account for cleaning up the commit of interest. 
>  
> Lets ensure clean planner account for all partitions when such savepoint 
> removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7824] Fixing incr cleaner with savepoint removal [hudi]

2024-05-31 Thread via GitHub


nsivabalan opened a new pull request, #11375:
URL: https://github.com/apache/hudi/pull/11375

   ### Change Logs
   
   Whenever a savepoint is removed, cleaner should resort to do entire 
partition cleaning instead of incr cleaning. We already attempted a fix 
https://github.com/apache/hudi/pull/10651, but it had a bug where not all 
partitions were account for. Whenever a savepoint is removed, cleaner should 
just resort to full partition cleaning. Anyways, savepoint meta files are 
deleted and savepoint will be tracking every latest base file for every 
partition, it makes sense to do entire partition list cleaning.
   
   ### Impact
   
   Whenever a savepoint is removed, cleaner should resort to do entire 
partition cleaning instead of incr cleaning. 
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7824) Fix incremental partitions fetch logic when savepoint is removed for Incr cleaner

2024-05-31 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-7824:
-

 Summary: Fix incremental partitions fetch logic when savepoint is 
removed for Incr cleaner
 Key: HUDI-7824
 URL: https://issues.apache.org/jira/browse/HUDI-7824
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning
Reporter: sivabalan narayanan


with incremental cleaner, if a savepoint is blocking cleaning up of a commit 
and cleaner moved ahead wrt earliest commit to retain, when savepoint is 
removed later, cleaner should account for cleaning up the commit of interest. 

 

Lets ensure clean planner account for all partitions when such savepoint 
removal is detected



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-5956] Simple repair spark sql dag ui display problem [hudi]

2024-05-31 Thread via GitHub


KnightChess commented on code in PR #8233:
URL: https://github.com/apache/hudi/pull/8233#discussion_r1623133643


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -123,6 +126,24 @@ object HoodieSparkSqlWriter {
 streamingWritesParamsOpt: Option[StreamingWriteParams] = 
Option.empty,
 hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty):
   (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = {
+//TODO reuse DataWritingCommand sparkPlan, reduce the number of sql list 
in SPARK UI SQL tag, rendering raw DAG

Review Comment:
   @codope Sorry to reply late, no overhead, just in the SQL TAB, it doesn't 
look beautiful. the TODO is hard fix now, because we reproduct logical plan in 
hudi command plan. I will open new pr to tracking fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2143241323

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 6abd40f1b77feb86cdc95d58cd2285c546a1f63e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24180)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2143219341

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   * 6abd40f1b77feb86cdc95d58cd2285c546a1f63e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11374:
URL: https://github.com/apache/hudi/pull/11374#issuecomment-2143216104

   
   ## CI report:
   
   * 05fab0df29530420f0a77abf46be996b70c1bc25 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7823) Simplify dependency management on exclusions

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7823:
-
Labels: pull-request-available  (was: )

> Simplify dependency management on exclusions
> 
>
> Key: HUDI-7823
> URL: https://issues.apache.org/jira/browse/HUDI-7823
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7823] Simplify dependency management on exclusions [hudi]

2024-05-31 Thread via GitHub


yihua opened a new pull request, #11374:
URL: https://github.com/apache/hudi/pull/11374

   ### Change Logs
   
   This PR simplifies the dependency management on exclusions by moving the 
common dependency exclusions to the root POM.
   
   ### Impact
   
   Simplifies dependency management on exclusions.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2143187023

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * ec6fa62945094d548dce7d7e8e6ef2363ba0d05f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24179)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7823) Simplify dependency management on exclusions

2024-05-31 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7823:
---

 Summary: Simplify dependency management on exclusions
 Key: HUDI-7823
 URL: https://issues.apache.org/jira/browse/HUDI-7823
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7819] Fix OptionsResolver#allowCommitOnEmptyBatch default value bug [hudi]

2024-05-31 Thread via GitHub


danny0405 commented on PR #11370:
URL: https://github.com/apache/hudi/pull/11370#issuecomment-2143140776

   And some UT failures: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=24160&view=logs&j=7601efb9-4019-552e-11ba-eb31b66593b2&t=9688f101-287d-53f4-2a80-87202516f5d0&l=17578


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-31 Thread via GitHub


danny0405 commented on code in PR #11043:
URL: https://github.com/apache/hudi/pull/11043#discussion_r1623048400


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BloomFiltersIndexSupport.scala:
##
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.HoodieConversionUtils.toScalaOption
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.FileSlice
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.hudi.metadata.HoodieTableMetadataUtil
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.Expression
+
+class BloomFiltersIndexSupport(spark: SparkSession,
+   metadataConfig: HoodieMetadataConfig,
+   metaClient: HoodieTableMetaClient) extends 
RecordLevelIndexSupport(spark, metadataConfig, metaClient) {

Review Comment:
   It's just a code reuse right? The RLI has nothing to do with the 
bloom_filter index query index.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-31 Thread via GitHub


danny0405 commented on code in PR #11043:
URL: https://github.com/apache/hudi/pull/11043#discussion_r1623048159


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestBloomFiltersIndexSupport.scala:
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.common.config.{HoodieMetadataConfig, TypedProperties}
+import org.apache.hudi.common.model.{FileSlice, HoodieTableType}
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
+import org.apache.hudi.common.testutils.RawTripTestPayload.recordsToStrings
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.metadata.HoodieMetadataFileSystemView
+import org.apache.hudi.testutils.HoodieSparkClientTestBase
+import org.apache.hudi.util.{JFunction, JavaConversions}
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, 
HoodieFileIndex}
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
Expression, Literal}
+import org.apache.spark.sql.functions.{col, not}
+import org.apache.spark.sql.types.StringType
+import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.assertTrue
+import org.junit.jupiter.api.{AfterEach, BeforeEach, Test}
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.EnumSource
+
+import java.util.concurrent.atomic.AtomicInteger
+import java.util.stream.Collectors
+import scala.collection.JavaConverters._
+import scala.collection.{JavaConverters, mutable}
+
+class TestBloomFiltersIndexSupport extends HoodieSparkClientTestBase {
+
+  val sqlTempTable = "hudi_tbl_bloom"
+  var spark: SparkSession = _
+  var instantTime: AtomicInteger = _
+  val metadataOpts: Map[String, String] = Map(
+HoodieMetadataConfig.ENABLE.key -> "true",
+HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key -> "true",
+HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key -> "_row_key"
+  )
+  val commonOpts: Map[String, String] = Map(
+"hoodie.insert.shuffle.parallelism" -> "4",
+"hoodie.upsert.shuffle.parallelism" -> "4",
+HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+RECORDKEY_FIELD.key -> "_row_key",
+PARTITIONPATH_FIELD.key -> "partition",
+PRECOMBINE_FIELD.key -> "timestamp",
+HoodieTableConfig.POPULATE_META_FIELDS.key -> "true"
+  ) ++ metadataOpts
+  var mergedDfList: List[DataFrame] = List.empty
+
+  @BeforeEach
+  override def setUp(): Unit = {
+initPath()
+initSparkContexts()
+initHoodieStorage()
+initTestDataGenerator()
+
+setTableName("hoodie_test")
+initMetaClient()
+
+instantTime = new AtomicInteger(1)
+
+spark = sqlContext.sparkSession
+  }
+
+  @AfterEach
+  override def tearDown(): Unit = {
+cleanupFileSystem()
+cleanupSparkContexts()
+  }
+
+  @ParameterizedTest
+  @EnumSource(classOf[HoodieTableType])
+  def testIndexInitialization(tableType: HoodieTableType): Unit = {
+val hudiOpts = commonOpts + (DataSourceWriteOptions.TABLE_TYPE.key -> 
tableType.name())
+doWriteAndValidateBloomFilters(
+  hudiOpts,
+  operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
+  saveMode = SaveMode.Overwrite)
+  }
+
+  /**
+   * Test case to do a write with updates and then validate file pruning using 
bloom filters.
+   */
+  @Test
+  def testBloomFiltersIndexFilePruning(): Unit = {
+var hudiOpts = commonOpts
+hudiOpts = hudiOpts + (
+  DataSourceReadOptions.ENABLE_DATA_SKIPPING.key -> "true")
+
+doWriteAndValidateBloomFilters(
+  hudiOpts,
+  operation = DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
+  saveMode = SaveMode.Overwrite,
+  shouldValidate = false)
+doWriteAndValidateBloomFilters(
+  hudiOpts,
+  operation = DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
+  saveMode = SaveMode.Append)
+
+createTempTable(hudiOpts)
+verifyQueryPredicate(hudiOpts, "_row_key")
+  }
+
+  private def createTempTable(hudiOpts: Map[String, String]): Unit = {
+val readDf = spark.read.format("hudi

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2143101212

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d504e37ab6cee7d80e53e6daf2df1ef95eea01b7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24169)
 
   * ec6fa62945094d548dce7d7e8e6ef2363ba0d05f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24179)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2143089758

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d504e37ab6cee7d80e53e6daf2df1ef95eea01b7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24169)
 
   * ec6fa62945094d548dce7d7e8e6ef2363ba0d05f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2143082242

   
   ## CI report:
   
   * 0dc960c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24173)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2143073580

   
   ## CI report:
   
   * 0dc960c61eb43e9c1f1e97cf60d772145e1b2c3e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24178)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24173)
 
   * 0dc960c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests (#10931)

2024-05-31 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 16e1adb5b3c [HUDI-7822] Resolve the conflicts between mixed hdfs and 
local path in Flink tests (#10931)
16e1adb5b3c is described below

commit 16e1adb5b3c8e3601044deec8e880ac15ccb74c8
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Sat Jun 1 06:34:51 2024 +0800

[HUDI-7822] Resolve the conflicts between mixed hdfs and local path in 
Flink tests (#10931)

Co-authored-by: Y Ethan Guo 
---
 .../hudi/table/catalog/TestHoodieCatalog.java   | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
index 98c98bebcce..f6737128698 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java
@@ -28,6 +28,7 @@ import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.configuration.FlinkOptions;
 import org.apache.hudi.configuration.HadoopConfigurations;
+import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieValidationException;
 import org.apache.hudi.keygen.ComplexAvroKeyGenerator;
 import org.apache.hudi.keygen.NonpartitionedAvroKeyGenerator;
@@ -66,12 +67,14 @@ import 
org.apache.flink.table.catalog.exceptions.TableAlreadyExistException;
 import org.apache.flink.table.catalog.exceptions.TableNotExistException;
 import org.apache.flink.table.types.DataType;
 import org.apache.flink.table.types.logical.LogicalTypeRoot;
+import org.apache.hadoop.fs.FileSystem;
 import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.io.TempDir;
 
 import java.io.File;
+import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
@@ -173,8 +176,12 @@ public class TestHoodieCatalog {
 streamTableEnv.getConfig().getConfiguration()
 
.setInteger(ExecutionConfigOptions.TABLE_EXEC_RESOURCE_DEFAULT_PARALLELISM, 2);
 
-File catalogPath = new File(tempFile.getPath());
-catalogPath.mkdir();
+try {
+  FileSystem fs = FileSystem.get(HadoopConfigurations.getHadoopConf(new 
Configuration()));
+  fs.mkdirs(new org.apache.hadoop.fs.Path(tempFile.getPath()));
+} catch (IOException e) {
+  throw new HoodieIOException("Failed to create tempFile dir.", e);
+}
 
 catalog = new HoodieCatalog("hudi", 
Configuration.fromMap(getDefaultCatalogOption()));
 catalog.open();
@@ -266,6 +273,7 @@ public class TestHoodieCatalog {
 
 // validate key generator for partitioned table
 HoodieTableMetaClient metaClient = createMetaClient(
+new HadoopStorageConfiguration(HadoopConfigurations.getHadoopConf(new 
Configuration())),
 catalog.inferTablePath(catalogPathStr, tablePath));
 String keyGeneratorClassName = 
metaClient.getTableConfig().getKeyGeneratorClassName();
 assertEquals(keyGeneratorClassName, 
SimpleAvroKeyGenerator.class.getName());
@@ -283,6 +291,7 @@ public class TestHoodieCatalog {
 
 catalog.createTable(singleKeyMultiplePartitionPath, 
singleKeyMultiplePartitionTable, false);
 metaClient = createMetaClient(
+new HadoopStorageConfiguration(HadoopConfigurations.getHadoopConf(new 
Configuration())),
 catalog.inferTablePath(catalogPathStr, 
singleKeyMultiplePartitionPath));
 keyGeneratorClassName = 
metaClient.getTableConfig().getKeyGeneratorClassName();
 assertThat(keyGeneratorClassName, 
is(ComplexAvroKeyGenerator.class.getName()));
@@ -300,6 +309,7 @@ public class TestHoodieCatalog {
 
 catalog.createTable(multipleKeySinglePartitionPath, 
multipleKeySinglePartitionTable, false);
 metaClient = createMetaClient(
+new HadoopStorageConfiguration(HadoopConfigurations.getHadoopConf(new 
Configuration())),
 catalog.inferTablePath(catalogPathStr, 
singleKeyMultiplePartitionPath));
 keyGeneratorClassName = 
metaClient.getTableConfig().getKeyGeneratorClassName();
 assertThat(keyGeneratorClassName, 
is(ComplexAvroKeyGenerator.class.getName()));
@@ -317,7 +327,9 @@ public class TestHoodieCatalog {
 
 catalog.createTable(nonPartitionPath, nonPartitionCatalogTable, false);
 
-metaClient = createMetaClient(catalog.inferTablePath(catalogPathStr, 
nonPartitionPath));
+metaClient = createMetaClient(
+new HadoopStorageConfiguration(HadoopConfigurations.getHadoopConf(new 
Configuration())),
+catalog.inferTab

Re: [PR] [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #10931:
URL: https://github.com/apache/hudi/pull/10931#issuecomment-2143060956

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/c82f3d79-edd6-4ae8-838e-8d760f153ed6";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests [hudi]

2024-05-31 Thread via GitHub


yihua merged PR #10931:
URL: https://github.com/apache/hudi/pull/10931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [MINOR] Avoid listing files for empty tables (#11155)

2024-05-31 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new acecf304254 [MINOR] Avoid listing files for empty tables (#11155)
acecf304254 is described below

commit acecf3042549583de31cad176fb500c55bb61700
Author: Tim Brown 
AuthorDate: Fri May 31 17:30:14 2024 -0500

[MINOR] Avoid listing files for empty tables (#11155)
---
 .../hudi/metadata/HoodieBackedTableMetadataWriter.java | 17 -
 .../hudi/table/action/commit/UpsertPartitioner.java| 18 +++---
 2 files changed, 23 insertions(+), 12 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 831c2e1882c..604399b7382 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -83,12 +83,14 @@ import org.slf4j.LoggerFactory;
 
 import java.io.FileNotFoundException;
 import java.io.IOException;
+import java.util.ArrayDeque;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
+import java.util.Queue;
 import java.util.Set;
 import java.util.function.Function;
 import java.util.stream.Collectors;
@@ -761,7 +763,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
* @return List consisting of {@code DirectoryInfo} for each partition found.
*/
   private List listAllPartitionsFromFilesystem(String 
initializationTime, Set pendingDataInstants) {
-List pathsToList = new LinkedList<>();
+if (dataMetaClient.getActiveTimeline().countInstants() == 0) {
+  return Collections.emptyList();
+}
+Queue pathsToList = new ArrayDeque<>();
 pathsToList.add(new StoragePath(dataWriteConfig.getBasePath()));
 
 List partitionsToBootstrap = new LinkedList<>();
@@ -773,16 +778,18 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 while (!pathsToList.isEmpty()) {
   // In each round we will list a section of directories
   int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
+  List pathsToProcess = new ArrayList<>(numDirsToList);
+  for (int i = 0; i < numDirsToList; i++) {
+pathsToProcess.add(pathsToList.poll());
+  }
   // List all directories in parallel
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing " + 
numDirsToList + " partitions from filesystem");
-  List processedDirectories = 
engineContext.map(pathsToList.subList(0, numDirsToList), path -> {
+  List processedDirectories = 
engineContext.map(pathsToProcess, path -> {
 HoodieStorage storage = new HoodieHadoopStorage(path, storageConf);
 String relativeDirPath = 
FSUtils.getRelativePartitionPath(storageBasePath, path);
 return new DirectoryInfo(relativeDirPath, 
storage.listDirectEntries(path), initializationTime, pendingDataInstants);
   }, numDirsToList);
 
-  pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, 
pathsToList.size()));
-
   // If the listing reveals a directory, add it to queue. If the listing 
reveals a hoodie partition, add it to
   // the results.
   for (DirectoryInfo dirInfo : processedDirectories) {
@@ -815,10 +822,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
* @return List consisting of {@code DirectoryInfo} for each partition found.
*/
   private List listAllPartitionsFromMDT(String 
initializationTime, Set pendingDataInstants) throws IOException {
-List dirinfoList = new LinkedList<>();
 List allPartitionPaths = metadata.getAllPartitionPaths().stream()
 .map(partitionPath -> dataWriteConfig.getBasePath() + 
StoragePath.SEPARATOR_CHAR + partitionPath).collect(Collectors.toList());
 Map> partitionFileMap = 
metadata.getAllFilesInPartitions(allPartitionPaths);
+List dirinfoList = new ArrayList<>(partitionFileMap.size());
 for (Map.Entry> entry : 
partitionFileMap.entrySet()) {
   dirinfoList.add(new DirectoryInfo(entry.getKey(), entry.getValue(), 
initializationTime, pendingDataInstants));
 }
diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
index 09904cd290e..ea125614170 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
+++ 
b/hudi-client/h

Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


yihua merged PR #11155:
URL: https://github.com/apache/hudi/pull/11155


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2143057842

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/189a84c2-9029-43c8-a4f2-e0d93a0d34bc";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2143057499

   Azure CI is green.
   https://github.com/apache/hudi/assets/2497195/b2e99712-2aa6-47d9-81a6-6fca43217863";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2143035154

   
   ## CI report:
   
   * 3062782a8b6b02da35c82a87c4ffa1f061f22dc3 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24177)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24174)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10931:
URL: https://github.com/apache/hudi/pull/10931#issuecomment-2143034731

   
   ## CI report:
   
   * 3d87799728e8015152212910997bf7e21ca3a40d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24176)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7821] Handle case where older proto message is read with new schema [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11373:
URL: https://github.com/apache/hudi/pull/11373#issuecomment-2143027374

   
   ## CI report:
   
   * 32abc805a2e2d4764215bd7dea93ce72c0532bec Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24171)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11368:
URL: https://github.com/apache/hudi/pull/11368#issuecomment-2143027333

   
   ## CI report:
   
   * d945405ab2605efcb2dd86a8fff6f9dc622ae14a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2143027274

   
   ## CI report:
   
   * 0dc960c61eb43e9c1f1e97cf60d772145e1b2c3e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24178)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24173)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10931:
URL: https://github.com/apache/hudi/pull/10931#issuecomment-2142986046

   
   ## CI report:
   
   * e09914c58cede10a0b8efb315837e6e9d34b1d95 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23048)
 
   * 3d87799728e8015152212910997bf7e21ca3a40d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24176)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2142986396

   
   ## CI report:
   
   * c62bc211274fbe2b31dd8d07d7ede8ecae5f6d64 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24093)
 
   * 3062782a8b6b02da35c82a87c4ffa1f061f22dc3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24177)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid listing files for empty tables [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11155:
URL: https://github.com/apache/hudi/pull/11155#issuecomment-2142978738

   
   ## CI report:
   
   * c62bc211274fbe2b31dd8d07d7ede8ecae5f6d64 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24093)
 
   * 3062782a8b6b02da35c82a87c4ffa1f061f22dc3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7814] Exclude unused transitive dependencies that introduce vulnerabilities [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11364:
URL: https://github.com/apache/hudi/pull/11364#issuecomment-2142979136

   
   ## CI report:
   
   * ff1e3d8a934fe1a2c92e341be610516476bf5d7a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24153)
 
   * 0dc960c61eb43e9c1f1e97cf60d772145e1b2c3e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7822] Resolve the conflicts between mixed hdfs and local path in Flink tests [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10931:
URL: https://github.com/apache/hudi/pull/10931#issuecomment-2142978283

   
   ## CI report:
   
   * e09914c58cede10a0b8efb315837e6e9d34b1d95 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23048)
 
   * 3d87799728e8015152212910997bf7e21ca3a40d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7822:
-
Labels: pull-request-available  (was: )

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11368:
URL: https://github.com/apache/hudi/pull/11368#issuecomment-2142970542

   
   ## CI report:
   
   * 1dde761d4147e9c1a94914759ca0bfd0f7d23ec7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24154)
 
   * d945405ab2605efcb2dd86a8fff6f9dc622ae14a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2142969807

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * d504e37ab6cee7d80e53e6daf2df1ef95eea01b7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24169)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851205#comment-17851205
 ] 

Ethan Guo commented on HUDI-7822:
-

https://github.com/apache/hudi/pull/10931

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-7822:
---

 Summary: Resolve the conflicts between mixed hdfs and local path 
in Flink tests
 Key: HUDI-7822
 URL: https://issues.apache.org/jira/browse/HUDI-7822
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7822) Resolve the conflicts between mixed hdfs and local path in Flink tests

2024-05-31 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7822:

Fix Version/s: 1.0.0

> Resolve the conflicts between mixed hdfs and local path in Flink tests
> --
>
> Key: HUDI-7822
> URL: https://issues.apache.org/jira/browse/HUDI-7822
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7821] Handle case where older proto message is read with new schema [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11373:
URL: https://github.com/apache/hudi/pull/11373#issuecomment-2142920593

   
   ## CI report:
   
   * 32abc805a2e2d4764215bd7dea93ce72c0532bec Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24171)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11368:
URL: https://github.com/apache/hudi/pull/11368#issuecomment-2142920538

   
   ## CI report:
   
   * 1dde761d4147e9c1a94914759ca0bfd0f7d23ec7 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24154)
 
   * d945405ab2605efcb2dd86a8fff6f9dc622ae14a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7718] Try to fetch the latestSourceProfile in HoodieIncrSource [hudi]

2024-05-31 Thread via GitHub


yihua commented on code in PR #11175:
URL: https://github.com/apache/hudi/pull/11175#discussion_r1622908331


##
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java:
##
@@ -344,7 +385,7 @@ private void 
readAndAssert(IncrSourceHelper.MissingCheckpointStrategy missingChe
 snapshotCheckPointImplClassOpt.map(className ->
 
properties.setProperty(SnapshotLoadQuerySplitter.Config.SNAPSHOT_LOAD_QUERY_SPLITTER_CLASS_NAME,
 className));
 TypedProperties typedProperties = new TypedProperties(properties);
-HoodieIncrSource incrSource = new HoodieIncrSource(typedProperties, jsc(), 
spark(), new DummySchemaProvider(HoodieTestDataGenerator.AVRO_SCHEMA));
+HoodieIncrSource incrSource = new HoodieIncrSource(typedProperties, jsc(), 
spark(), metrics, new DefaultStreamContext(new 
DummySchemaProvider(HoodieTestDataGenerator.AVRO_SCHEMA), sourceProfile));

Review Comment:
   Could you validate the source parallelism is changed after passing the 
source profile?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7718] Try to fetch the latestSourceProfile in HoodieIncrSource [hudi]

2024-05-31 Thread via GitHub


yihua commented on code in PR #11175:
URL: https://github.com/apache/hudi/pull/11175#discussion_r1622904727


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java:
##
@@ -231,7 +243,15 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
 // Remove Hoodie meta columns except partition path from input source
 String[] colsToDrop = shouldDropMetaFields ? 
HoodieRecord.HOODIE_META_COLUMNS.stream().toArray(String[]::new) :
 HoodieRecord.HOODIE_META_COLUMNS.stream().filter(x -> 
!x.equals(HoodieRecord.PARTITION_PATH_METADATA_FIELD)).toArray(String[]::new);
-final Dataset src = source.drop(colsToDrop);
+Dataset src = source.drop(colsToDrop);
+if (getLatestSourceProfile().isPresent()) {
+  src = coalesceOrRepartition(src, 
getLatestSourceProfile().get().getSourcePartitions());
+}

Review Comment:
   Could `getLatestSourceProfile().map().orElse()` be used instead of 
reassigning the variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7821] Handle case where older proto message is read with new schema [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11373:
URL: https://github.com/apache/hudi/pull/11373#issuecomment-2142911283

   
   ## CI report:
   
   * 32abc805a2e2d4764215bd7dea93ce72c0532bec UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2142900530

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 475a1bc220eaee04fa78ba46a922b434b8306047 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24150)
 
   * d504e37ab6cee7d80e53e6daf2df1ef95eea01b7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24169)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7821:
-
Labels: pull-request-available  (was: )

> Handle schema evolution in proto to avro conversion
> ---
>
> Key: HUDI-7821
> URL: https://issues.apache.org/jira/browse/HUDI-7821
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Users can encounter errors when a batch of data was written with an older 
> schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7821] Handle case where older proto message is read with new schema [hudi]

2024-05-31 Thread via GitHub


the-other-tim-brown opened a new pull request, #11373:
URL: https://github.com/apache/hudi/pull/11373

   ### Change Logs
   
   - Adds support for handling proto messages that are missing fields, 
previously this would cause null pointer exceptions
   
   ### Impact
   
   Allows users consuming protos to evolve their schemas
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7669] Move config classes and utils to proper places [hudi]

2024-05-31 Thread via GitHub


yihua closed pull request #11095: [HUDI-7669] Move config classes and utils to 
proper places
URL: https://github.com/apache/hudi/pull/11095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7821) Handle schema evolution in proto to avro conversion

2024-05-31 Thread Timothy Brown (Jira)
Timothy Brown created HUDI-7821:
---

 Summary: Handle schema evolution in proto to avro conversion
 Key: HUDI-7821
 URL: https://issues.apache.org/jira/browse/HUDI-7821
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Timothy Brown


Users can encounter errors when a batch of data was written with an older 
schema and a new schema has fields that are not present in the old data



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7669] Move config classes and utils to proper places [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11095:
URL: https://github.com/apache/hudi/pull/11095#issuecomment-2142879176

   Closing this PR as it is no longer required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix operation total io should not exceed the target io limit [hudi]

2024-05-31 Thread via GitHub


yihua commented on code in PR #11174:
URL: https://github.com/apache/hudi/pull/11174#discussion_r1622874529


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/BoundedIOCompactionStrategy.java:
##
@@ -44,10 +44,10 @@ public List 
orderAndFilter(HoodieWriteConfig writeCon
 for (HoodieCompactionOperation op : operations) {
   long opIo = op.getMetrics().get(TOTAL_IO_MB).longValue();
   targetIORemaining -= opIo;
-  finalOperations.add(op);
-  if (targetIORemaining <= 0) {
+  if (targetIORemaining < 0) {
 return finalOperations;
   }
+  finalOperations.add(op);

Review Comment:
   This can lead to starvation if the target IO limit is always smaller than 
the `TOTAL_IO_MB` of the first compaction operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11368:
URL: https://github.com/apache/hudi/pull/11368#issuecomment-2142855292

   Could you also raise a PR against 
https://github.com/apache/hudi/tree/branch-0.x?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7748] Update ErrorTableAwareChainedTransformer.java [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11197:
URL: https://github.com/apache/hudi/pull/11197#issuecomment-2142852669

   Are the changes in this PR still needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (9536d40f75d -> 7f8da18e550)

2024-05-31 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 9536d40f75d [MINOR] Avoid logging full commit metadata at info level 
(#11372)
 add 7f8da18e550 [HUDI-7766] Adding staging jar deployment command for 
Spark 3.5 and Scala 2.13 profile (#11234)

No new revisions were added by this update.

Summary of changes:
 scripts/release/deploy_staging_jars.sh | 9 +
 1 file changed, 9 insertions(+)



Re: [PR] [HUDI-7766] Adding staging jar deployment command for Spark 3.5 and Scala 2.13 profile [hudi]

2024-05-31 Thread via GitHub


yihua merged PR #11234:
URL: https://github.com/apache/hudi/pull/11234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7766] Adding staging jar deployment command for Spark 3.5 and Scala 2.13 profile [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11234:
URL: https://github.com/apache/hudi/pull/11234#issuecomment-2142851461

   Skipping CI as only the release script is updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP][ENM] Fix pending compaction check 3 [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11217:
URL: https://github.com/apache/hudi/pull/11217#issuecomment-2142849288

   Closing this draft.  Feel free to reopen when ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP][ENM] Fix pending compaction check 3 [hudi]

2024-05-31 Thread via GitHub


yihua closed pull request #11217: [WIP][ENM] Fix pending compaction check 3
URL: https://github.com/apache/hudi/pull/11217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2142847155

   
   ## CI report:
   
   * c98242b22fb2518c0cc93c037df558037030500f UNKNOWN
   * 475a1bc220eaee04fa78ba46a922b434b8306047 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24150)
 
   * d504e37ab6cee7d80e53e6daf2df1ef95eea01b7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7816]: Provide SourceProfileSupplier option into the SnapshotLoadQuerySplitter [hudi]

2024-05-31 Thread via GitHub


yihua commented on code in PR #11368:
URL: https://github.com/apache/hudi/pull/11368#discussion_r1622849730


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java:
##
@@ -61,20 +62,21 @@ public SnapshotLoadQuerySplitter(TypedProperties 
properties) {
*
* @param df The dataset to process.
* @param beginCheckpointStr The starting checkpoint string.
+   * @param sourceProfileSupplier An Option of a SourceProfileSupplier to use 
in load splitting implementation
* @return The next checkpoint as an Option.
*/
-  public abstract Option getNextCheckpoint(Dataset df, String 
beginCheckpointStr);
+  public abstract Option getNextCheckpoint(Dataset df, String 
beginCheckpointStr, Option sourceProfileSupplier);
 
   /**
-   * Retrieves the next checkpoint based on query information.
+   * Retrieves the next checkpoint based on query information and a 
SourceProfileSupplier.
*
* @param df The dataset to process.
* @param queryInfo The query information object.
* @return Updated query information with the next checkpoint, in case of 
empty checkpoint,
* returning endPoint same as queryInfo.getEndInstant().
*/
-  public QueryInfo getNextCheckpoint(Dataset df, QueryInfo queryInfo) {
-return getNextCheckpoint(df, queryInfo.getStartInstant())
+  public QueryInfo getNextCheckpoint(Dataset df, QueryInfo queryInfo, 
Option sourceProfileSupplier) {

Review Comment:
   Add the new parameter `@param sourceProfileSupplier ` to the 
javadocs, same as above.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SnapshotLoadQuerySplitter.java:
##
@@ -61,20 +62,21 @@ public SnapshotLoadQuerySplitter(TypedProperties 
properties) {
*
* @param df The dataset to process.
* @param beginCheckpointStr The starting checkpoint string.
+   * @param sourceProfileSupplier An Option of a SourceProfileSupplier to use 
in load splitting implementation

Review Comment:
   Let's mark this class with `@PublicAPIClass(maturity = 
ApiMaturityLevel.EVOLVING)` and the abstract methods with 
`@PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)`, given this class 
serves as an extendable API class for user to plug in custom implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2142823018

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 18fbd92eec10c49025db364be79cc9dbfccee362 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24162)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (130ea1a3142 -> 9536d40f75d)

2024-05-31 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 130ea1a3142 [HUDI-7762]  Optimizing Hudi Table Check with Delta Lake 
by Refining Class Name Checks In Spark3.5 (#11224)
 add 9536d40f75d [MINOR] Avoid logging full commit metadata at info level 
(#11372)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/client/BaseHoodieTableServiceClient.java | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)



Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


yihua merged PR #11372:
URL: https://github.com/apache/hudi/pull/11372


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


yihua commented on PR #11372:
URL: https://github.com/apache/hudi/pull/11372#issuecomment-2142819366

   Could you also raise a PR against 
https://github.com/apache/hudi/tree/branch-0.x?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-05-31 Thread via GitHub


yihua commented on code in PR #10957:
URL: https://github.com/apache/hudi/pull/10957#discussion_r1622828538


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -231,7 +231,13 @@ private static Option findNestedField(Schema 
schema, String[] fiel
 if (!nestedPart.isPresent()) {
   return Option.empty();
 }
-return nestedPart;
+boolean isUnion = false;

Review Comment:
   Could you write a unit test around the new logic (make the method accessible 
by the test class)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11372:
URL: https://github.com/apache/hudi/pull/11372#issuecomment-2142758270

   
   ## CI report:
   
   * d9f2656aac6864e31474cc45506ceeefc8b8b36e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24166)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2142756169

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 2201cb0dea3acbe7597b319be7f14ce7a2a8543f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24165)
 
   * 18fbd92eec10c49025db364be79cc9dbfccee362 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11372:
URL: https://github.com/apache/hudi/pull/11372#issuecomment-2142672637

   
   ## CI report:
   
   * d9f2656aac6864e31474cc45506ceeefc8b8b36e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24166)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2142670862

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 2201cb0dea3acbe7597b319be7f14ce7a2a8543f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24165)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11372:
URL: https://github.com/apache/hudi/pull/11372#issuecomment-2142662394

   
   ## CI report:
   
   * d9f2656aac6864e31474cc45506ceeefc8b8b36e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #10422:
URL: https://github.com/apache/hudi/pull/10422#issuecomment-2142660333

   
   ## CI report:
   
   * 99517e23baa60a6a0602e9daf7f522f3c1dcfa1e UNKNOWN
   * 15ed1ad17c8b99804d6e404342a11fab6e212935 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22078)
 
   * 2201cb0dea3acbe7597b319be7f14ce7a2a8543f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-31 Thread via GitHub


the-other-tim-brown commented on code in PR #11154:
URL: https://github.com/apache/hudi/pull/11154#discussion_r1622695605


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -151,11 +163,11 @@ public static Schema nullableSchema(Schema schema) {
* @param schema a avro schema.
* @return a hudi type.
*/
-  public static Type buildTypeFromAvroSchema(Schema schema) {
+  public static Type buildTypeFromAvroSchema(Schema schema, Map existingNameToPositions) {
 // set flag to check this has not been visited.
-Deque visited = new LinkedList();
-AtomicInteger nextId = new AtomicInteger(1);
-return visitAvroSchemaToBuildType(schema, visited, true, nextId);
+Deque visited = new LinkedList<>();
+AtomicInteger nextId = new AtomicInteger(0);

Review Comment:
   I thought this was a bug since you typically start with 0 when coding



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7713] Enforce ordering of fields during schema reconciliation [hudi]

2024-05-31 Thread via GitHub


jonvex commented on code in PR #11154:
URL: https://github.com/apache/hudi/pull/11154#discussion_r1622678545


##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -117,10 +120,19 @@ public static Schema convert(Type type, String name) {
 
   /** Convert an avro schema into internal type. */
   public static Type convertToField(Schema schema) {
-return buildTypeFromAvroSchema(schema);
+return buildTypeFromAvroSchema(schema, Collections.emptyMap());
   }
 
+  private static Type convertToField(Schema schema, Map 
existingFieldNameToPositionMapping) {
+return buildTypeFromAvroSchema(schema, existingFieldNameToPositionMapping);
+  }
+
+

Review Comment:
   remove empty line



##
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java:
##
@@ -151,11 +163,11 @@ public static Schema nullableSchema(Schema schema) {
* @param schema a avro schema.
* @return a hudi type.
*/
-  public static Type buildTypeFromAvroSchema(Schema schema) {
+  public static Type buildTypeFromAvroSchema(Schema schema, Map existingNameToPositions) {
 // set flag to check this has not been visited.
-Deque visited = new LinkedList();
-AtomicInteger nextId = new AtomicInteger(1);
-return visitAvroSchemaToBuildType(schema, visited, true, nextId);
+Deque visited = new LinkedList<>();
+AtomicInteger nextId = new AtomicInteger(0);

Review Comment:
   why do we go from 1->0? Is this because we remove
   ```
if (firstVisitRoot) {
 nextAssignId = 0;
   }
```   



##
hudi-spark-datasource/hudi-spark-common/src/test/java/org/apache/hudi/TestHoodieSchemaUtils.java:
##
@@ -239,6 +240,51 @@ void testMissingColumn(boolean allowDroppedColumns) {
 }
   }
 
+  @Test
+  void testFieldReordering() {
+// field order changes and incoming schema is missing an existing field
+Schema start = createRecord("reorderFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createPrimitiveField("field3", Schema.Type.INT));
+Schema end = createRecord("reorderFields",
+createPrimitiveField("field3", Schema.Type.INT),
+createPrimitiveField("field1", Schema.Type.INT));
+assertEquals(start, deduceWriterSchema(end, start, true));
+
+// nested field ordering changes and new field is added
+start = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createArrayField("field3", createRecord("nestedRecord",
+createPrimitiveField("nestedField1", Schema.Type.INT),
+createPrimitiveField("nestedField2", Schema.Type.INT),
+createPrimitiveField("nestedField3", Schema.Type.INT))),
+createPrimitiveField("field4", Schema.Type.INT));
+end = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createPrimitiveField("field5", Schema.Type.INT),
+createArrayField("field3", createRecord("nestedRecord",
+createPrimitiveField("nestedField2", Schema.Type.INT),
+createPrimitiveField("nestedField1", Schema.Type.INT),
+createPrimitiveField("nestedField3", Schema.Type.INT),
+createPrimitiveField("nestedField4", Schema.Type.INT))),
+createPrimitiveField("field4", Schema.Type.INT));
+
+Schema expected = createRecord("reorderNestedFields",
+createPrimitiveField("field1", Schema.Type.INT),
+createPrimitiveField("field2", Schema.Type.INT),
+createArrayField("field3", createRecord("reorderNestedFields.field3",

Review Comment:
   ok, can you please change the nested record name to 
`reorderNestedFields.field3` in start and end? That way we isolate what we are 
testing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Avoid logging full commit metadata at info level [hudi]

2024-05-31 Thread via GitHub


the-other-tim-brown opened a new pull request, #11372:
URL: https://github.com/apache/hudi/pull/11372

   ### Change Logs
   
   - Updates log messages to avoid logging full commit metadata after each 
table service to reduce volume of logs when working with large tables
   
   ### Impact
   
   - Reduce log volume during normal operation
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change. If not, put "none"._
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Hudi Deltastreamer compaction is taking longer duration [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11273:
URL: https://github.com/apache/hudi/issues/11273#issuecomment-2142552863

   @SuneethaYamani Metadata table helps you to reduce file listing api calls. 
You can disable in case this is only becoming the bottleneck.
   
   Although we want to understand why it's taking so long. Can you share writer 
configs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] duplicated records when use insert overwrite [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11358:
URL: https://github.com/apache/hudi/issues/11358#issuecomment-2142473345

   @njalan Also as I understood, data what you are writing is output of 10 
tables. SO when you are doing insert_overwrite, Does that source data frame 
contains dups?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [WIP] [HUDI-6787] Implement the HoodieFileGroupReader API for Hive [hudi]

2024-05-31 Thread via GitHub


codope commented on code in PR #10422:
URL: https://github.com/apache/hudi/pull/10422#discussion_r1622387730


##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieFileGroupReaderRecordReader.java:
##
@@ -0,0 +1,294 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.hadoop;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.config.HoodieCommonConfig;
+import org.apache.hudi.common.config.HoodieReaderConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.BaseFile;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.TableSchemaResolver;
+import org.apache.hudi.common.table.read.HoodieFileGroupReader;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TablePathUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader;
+import org.apache.hudi.hadoop.realtime.RealtimeSplit;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;
+import org.apache.hadoop.hive.serde2.ColumnProjectionUtils;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.UnaryOperator;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED;
+import static 
org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.MAX_MEMORY_FOR_MERGE;
+import static 
org.apache.hudi.common.config.HoodieMemoryConfig.SPILLABLE_MAP_BASE_PATH;
+
+public class HoodieFileGroupReaderRecordReader implements 
RecordReader  {
+
+  public interface HiveReaderCreator {
+org.apache.hadoop.mapred.RecordReader 
getRecordReader(
+final org.apache.hadoop.mapred.InputSplit split,
+final org.apache.hadoop.mapred.JobConf job,
+final org.apache.hadoop.mapred.Reporter reporter
+) throws IOException;
+  }
+
+  private final HiveHoodieReaderContext readerContext;
+  private final HoodieFileGroupReader fileGroupReader;
+  private final ArrayWritable arrayWritable;
+  private final NullWritable nullWritable = NullWritable.get();
+  private final InputSplit inputSplit;
+  private final JobConf jobConfCopy;
+  private final UnaryOperator reverseProjection;
+
+  public HoodieFileGroupReaderRecordReader(HiveReaderCreator readerCreator,
+   final InputSplit split,
+   final JobConf jobConf,
+   final Reporter reporter) throws 
IOException {
+this.jobConfCopy = new JobConf(jobConf);
+HoodieRealtimeInputFormatUtils.cleanProjectionColumnIds(jobConfCopy);
+Set partitionColumns = new 
HashSet<>(getPartitionFieldNames(jobConfCopy));
+this.inputSplit = split;
+
+FileSplit fileSplit = (FileSplit) split;
+String tableBasePath = getTableBasePath(split, jo

(hudi) branch master updated (0e55f0900d8 -> 130ea1a3142)

2024-05-31 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 0e55f0900d8 [HUDI-7817] Use Jackson Core instead of 
org.codehaus.jackson for JSON encoding (#11369)
 add 130ea1a3142 [HUDI-7762]  Optimizing Hudi Table Check with Delta Lake 
by Refining Class Name Checks In Spark3.5 (#11224)

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala | 1 -
 .../main/scala/org/apache/spark/sql/adapter/Spark3_5Adapter.scala   | 6 +-
 2 files changed, 5 insertions(+), 2 deletions(-)



Re: [PR] [HUDI-7762] Optimizing Hudi Table Check with Delta Lake by Refining Class Name Checks In Spark3.5 [hudi]

2024-05-31 Thread via GitHub


leesf merged PR #11224:
URL: https://github.com/apache/hudi/pull/11224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] issue with reading the data using hudi streamer [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11263:
URL: https://github.com/apache/hudi/issues/11263#issuecomment-2141946954

   Using schema registry fixed this issue. Discussed in this thread - 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1716384858692059


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] issue with reading the data using hudi streamer [hudi]

2024-05-31 Thread via GitHub


codope closed issue #11263: issue with reading the data using hudi streamer
URL: https://github.com/apache/hudi/issues/11263


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] - Partial update of the MOR table after compaction with Hudi Streamer [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11348:
URL: https://github.com/apache/hudi/issues/11348#issuecomment-2141809072

   @kirillklimenko I tried to mimic similar scenario but it is avoiding columns 
with null values. Can you come up with reproducible script.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] duplicated records when use insert overwrite [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11358:
URL: https://github.com/apache/hudi/issues/11358#issuecomment-2141806002

   @njalan Are you using multi writers? Can you come up with a reproducible 
script. You are using very old Hudi version though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Reliable ingestion from AWS S3 using Hudi is failing with software.amazon.awssdk.services.sqs.model.EmptyBatchRequestException [hudi]

2024-05-31 Thread via GitHub


ad1happy2go commented on issue #11168:
URL: https://github.com/apache/hudi/issues/11168#issuecomment-2141790434

   @SuneethaYamani Yeah this was was there in 0.14.1 only. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue [hudi]

2024-05-31 Thread via GitHub


codope closed issue #11349: [SUPPORT] Spark-Hudi: Unable to perform Hard delete 
using Pyspark on HUDI table from AWS Glue
URL: https://github.com/apache/hudi/issues/11349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2141435482

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * b4a5700f408e7ef6639eb05528a029d7de45e99f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24161)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] StreamWriteFunction support Exectly-Once in Flink ? [hudi]

2024-05-31 Thread via GitHub


seekforshell closed issue #11004: [SUPPORT] StreamWriteFunction support 
Exectly-Once in Flink ?
URL: https://github.com/apache/hudi/issues/11004


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2141364905

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 6ece7645a69b367901c71ab78dea15f39d69fca5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24140)
 
   * b4a5700f408e7ef6639eb05528a029d7de45e99f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24161)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7007] Add bloom_filters index support on read side [hudi]

2024-05-31 Thread via GitHub


hudi-bot commented on PR #11043:
URL: https://github.com/apache/hudi/pull/11043#issuecomment-2141355244

   
   ## CI report:
   
   * 541b544049e68b3d22cdf0f5159fbd9b0005d345 UNKNOWN
   * 6ece7645a69b367901c71ab78dea15f39d69fca5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=24140)
 
   * b4a5700f408e7ef6639eb05528a029d7de45e99f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org