Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058358581

   Upon further testing after upgrading to the new master version, we have 
discovered missing data. As per our testing expectations, the results for all 
days should be consistent and equal to the data from the first day. However, as 
evident from the screenshot attached, the data for subsequent days is 
inconsistent. I have confirmed that the entire data system has been stopped for 
more than half an hour, ruling out the possibility of any pending or unfinished 
data processing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]Data Loss Issue with Hudi Table After 3 Days of Continuous Writes [hudi]

2024-04-15 Thread via GitHub


juice411 commented on issue #11016:
URL: https://github.com/apache/hudi/issues/11016#issuecomment-2058328083

   Could you please provide your contact information in China, as I have 
noticed you are located in Hangzhou? It would be helpful for further 
communication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


codope commented on code in PR #11021:
URL: https://github.com/apache/hudi/pull/11021#discussion_r1566778192


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -349,10 +350,15 @@ case class HoodieFileIndex(spark: SparkSession,
   Option.empty
 } else if (recordKeys.nonEmpty) {
   Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), 
recordKeys))
-} else if (functionalIndex.isIndexAvailable && !queryFilters.isEmpty) {
+} else if (functionalIndex.isIndexAvailable && queryFilters.nonEmpty && 
functionalIndex.extractSparkFunctionNames(queryFilters).nonEmpty) {
+  val functionToColumnNames = 
functionalIndex.extractSparkFunctionNames(queryFilters)
+  // Currently, only one functional index in the query is supported. 
HUDI-7620 for supporting multiple functions.
+  checkState(functionToColumnNames.size == 1, "Currently, only one 
function with functional index in the query is supported")
+  val (indexFunction, targetColumnName) = functionToColumnNames.head
+  val partitionOption = 
functionalIndex.getFunctionalIndexPartition(indexFunction, targetColumnName)
   val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
   val shouldReadInMemory = functionalIndex.shouldReadInMemory(this, 
queryReferencedColumns)
-  val indexDf = functionalIndex.loadFunctionalIndexDataFrame("", 
shouldReadInMemory)
+  val indexDf = 
functionalIndex.loadFunctionalIndexDataFrame(partitionOption.get, 
shouldReadInMemory)

Review Comment:
   Handled the case. Now, functional index partition option will get lazily 
initialized and the whole if block is called if the option is nonEmpty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7580] Fix order of fields when records inserted out of order [hudi]

2024-04-15 Thread via GitHub


codope commented on PR #11019:
URL: https://github.com/apache/hudi/pull/11019#issuecomment-2058318220

   > Wouldn't it be better to fail create table if the partition columns are 
not at the end? This seems like it could lead to more confusion since you still 
need to do the insert into with the partition columns at the end.
   
   Spark-sql has no such restriction, so I don't think it's a good idea to fail 
create table if the partition columns are not at the end. Also, note that the 
issue was fixed in Hudi 0.11.0 in 
https://github.com/apache/hudi/commit/ea1fbc71ecf9eec20815e2bc51bb8489e4ca1804, 
and unfortunately regressed in 
https://github.com/apache/hudi/commit/cfd0c1ee34460332053491fd1e68c2607c14e958. 
This PR is merely trying to restore the behavior of Hudi 0.11.0 while still 
keeping the perf improvements made by the latter commit intact.
   
   > Also, how would this affect the other sql commands?
   
   This change is only in `InsertIntoHoodieTableCommand`, and should not affect 
any sql command other than `INSERT INTO` and `CREATE TABLE AS SELECT`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7580] Fix order of fields when records inserted out of order [hudi]

2024-04-15 Thread via GitHub


codope commented on code in PR #11019:
URL: https://github.com/apache/hudi/pull/11019#discussion_r1566760872


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala:
##
@@ -142,12 +142,26 @@ object InsertIntoHoodieTableCommand extends Logging with 
ProvidesHoodieConfig wi
 //   positionally for example
 val expectedQueryColumns = 
catalogTable.tableSchemaWithoutMetaFields.filterNot(f => 
staticPartitionValues.contains(f.name))
 val coercedQueryOutput = 
coerceQueryOutputColumns(StructType(expectedQueryColumns), cleanedQuery, 
catalogTable, conf)
+val coercedQueryOutputWithoutMetaFields = 
removeMetaFields(coercedQueryOutput.output)
+val dataProjectsWithoutMetaFields = 
getTableFieldsAlias(coercedQueryOutputWithoutMetaFields, 
StructType(expectedQueryColumns).fields)
 // After potential reshaping validate that the output of the query 
conforms to the table's schema
 validate(removeMetaFields(coercedQueryOutput.schema), partitionsSpec, 
catalogTable)
+val staticPartitionValuesExprs = 
createStaticPartitionValuesExpressions(staticPartitionValues, 
targetPartitionSchema)
+Project(dataProjectsWithoutMetaFields ++ staticPartitionValuesExprs, 
coercedQueryOutput)
+  }
 
-val staticPartitionValuesExprs = 
createStaticPartitionValuesExpressions(staticPartitionValues, 
targetPartitionSchema, conf)
-
-Project(coercedQueryOutput.output ++ staticPartitionValuesExprs, 
coercedQueryOutput)
+  private def getTableFieldsAlias(queryOutputWithoutMetaFields: Seq[Attribute],
+  schemaWithoutMetaFields: Seq[StructField]): 
Seq[Alias] = {
+queryOutputWithoutMetaFields.zip(schemaWithoutMetaFields).map { case 
(dataAttr, dataField) =>
+  val targetAttrOption = if (dataAttr.name.startsWith("col")) {

Review Comment:
   This is to ignore any unresolved attributes (ideally, we should not hit this 
situation). This had been there in versions before 0.12.0 - 
https://github.com/apache/hudi/commit/ea1fbc71ecf9eec20815e2bc51bb8489e4ca1804#diff-1b7609d92cd5c6f871c01031ab923e18470ab5077271639905099241320314e7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2058301553

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   * 24d5e7f082788b257a42598df5f1d2378e32b041 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23264)
 
   * 5115971a4b483de1534a8dfdb943e40ca95aceef Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23280)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2058292874

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   * 24d5e7f082788b257a42598df5f1d2378e32b041 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23264)
 
   * 5115971a4b483de1534a8dfdb943e40ca95aceef UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058284585

   
   ## CI report:
   
   * 67ca721df255223c873303aeccf7900c29f7811a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] The Hive run_sync_tool's Logged Command & The Actual Command Do Not Match [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on issue #11029:
URL: https://github.com/apache/hudi/issues/11029#issuecomment-2058252483

   Sure, thanks for the nice findings and welcome to any contributions:)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] How we can speed up individual file write(HoodieMergeHandle part) [hudi]

2024-04-15 Thread via GitHub


xushiyan commented on issue #10997:
URL: https://github.com/apache/hudi/issues/10997#issuecomment-2058238697

   > we have clustering to group rows together, but it's still thousands of 
files affected. 75th percentile of individual file overwrite(task in the Doing 
partition and writing data stage) takes ~40-60 seconds
   
   based on this, i think clustering can be tuned further to rewrite files such 
that more updates can be targeted to the same file to reduce write 
amplification. Make sure your number of clustering groups is not limited to 
default 30, otherwise you miss a lot of files to cluster. COW is expected to 
have high write amplification with heavy updates, especially if you spread out 
the updates to a lot of files. Also consider a better partitioning to have 
updates concentrated on a few partitions if possible. Upgrade to newer version 
to try configuring the executor type too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (40d4f489389 -> c54be848f96)

2024-04-15 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 40d4f489389 [HUDI-7577] Avoid MDT compaction instant time conflicts 
(#10992)
 add c54be848f96 [MINOR] Remove redundant lines in StreamSync and 
TestStreamSyncUnitTests (#11027)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/utilities/streamer/StreamSync.java   |  4 
 .../utilities/streamer/TestStreamSyncUnitTests.java  | 20 
 2 files changed, 24 deletions(-)



Re: [PR] [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests [hudi]

2024-04-15 Thread via GitHub


yihua merged PR #11027:
URL: https://github.com/apache/hudi/pull/11027


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058193126

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   * 67ca721df255223c873303aeccf7900c29f7811a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058187793

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   * 67ca721df255223c873303aeccf7900c29f7811a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2058182194

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * f7ab315084f8534388db563a20d34b174cc63fa3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7515] Fix partition metadata write failure [hudi]

2024-04-15 Thread via GitHub


boneanxs commented on code in PR #10886:
URL: https://github.com/apache/hudi/pull/10886#discussion_r1566657358


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java:
##
@@ -92,11 +92,12 @@ public int getPartitionDepth() {
 
   /**
* Write the metadata safely into partition atomically.
+   * To avoid concurrent write into the same partition (for example in 
speculative case),
+   * please make sure writeToken is unique.
*/
-  public void trySave(int taskPartitionId) {
+  public void trySave(String writeToken) throws IOException {

Review Comment:
   Hey, @Tartarus0zm what is parallel write? 2 jobs writing to the same path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058148444

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058142670

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11027:
URL: https://github.com/apache/hudi/pull/11027#issuecomment-2058142651

   
   ## CI report:
   
   * b96388ad837c124fb63a8655f295fadebc37319f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23273)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058137317

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]

2024-04-15 Thread via GitHub


wombatu-kun commented on PR #10942:
URL: https://github.com/apache/hudi/pull/10942#issuecomment-2058119247

   @nsivabalan Hi! Sorry to bother you, but you are reporter of this task. 
Could you please review my PR?  
   Or close the PR if i totally misunderstood the task and did it wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-6762) Remove usages of MetadataRecordsGenerationParams

2024-04-15 Thread Vova Kolmakov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov closed HUDI-6762.
---
 Reviewers: Sagar Sumit
Resolution: Fixed

Fixed via master branch: 7c12decc86ccfbdc8bd06fe64d3e1c507cbbfbf6

> Remove usages of MetadataRecordsGenerationParams
> 
>
> Key: HUDI-6762
> URL: https://issues.apache.org/jira/browse/HUDI-6762
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vova Kolmakov
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> MetadataRecordsGenerationParams is deprecated. We already rely on table 
> config for enabled mdt partition types. See if we can remove this POJO.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058104791

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2058104279

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * 7f43200dfc27f8ff499c9d0c9c375b635120f67e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23272)
 
   * f7ab315084f8534388db563a20d34b174cc63fa3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23275)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] The Hive run_sync_tool's Logged Output & The Actual Command Do Not Match [hudi]

2024-04-15 Thread via GitHub


samserpoosh opened a new issue, #11029:
URL: https://github.com/apache/hudi/issues/11029

   This is a pretty small/minor issue I noticed while working with the 
`HiveSyncTool`. Essentially what's being logged does **not** match what's 
actually being executed:
   
   
https://github.com/apache/hudi/blob/40d4f489389083e3c6d69954361d3de4aec8186a/hudi-sync/hudi-hive-sync/run_sync_tool.sh#L54-L55
   
   As you see in the above snippet, the JARs passed to the `java -cp` command 
have one order when they're logged and another when the command is actually 
executed. I made the below change on my end and was able to run the 
`HiveSyncTool` without any issues. But with the original ordering, I was 
getting `ClassNotFound` exception.
   
   ```diff
   echo "Running Command : java -cp 
${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:$HUDI_HIVE_UBER_JAR 
org.apache.hudi.hive.HiveSyncTool $@"
   - java -cp $HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR} 
org.apache.hudi.hive.HiveSyncTool "$@"
   + java -cp ${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:$HUDI_HIVE_UBER_JAR 
org.apache.hudi.hive.HiveSyncTool "$@" 
   ```
   
   I'm happy to submit a PR IF you all think this is indeed something that 
should be resolved. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058098918

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2058098358

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * d5f312761099a9c57394f89c9b481e58773cb17f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22720)
 
   * 7f43200dfc27f8ff499c9d0c9c375b635120f67e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23272)
 
   * f7ab315084f8534388db563a20d34b174cc63fa3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-15 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837477#comment-17837477
 ] 

Ethan Guo commented on HUDI-7596:
-

Say module A is depended on by module B, and there are functional tests in B, 
exercising code in A. I think the report for B will not include A’s classes? 
Hudi has a lot of reliance on "functional" tests on hudi-spark-client for e.g 
that exercise code in hudi-client-common or hudi-common. So, Jacoco can 
underreport coverage for dependent modules.

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Danny Chen
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-15 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837476#comment-17837476
 ] 

Ethan Guo commented on HUDI-7596:
-

https://www.baeldung.com/maven-jacoco-multi-module-projecthttps://www.baeldung.com/maven-jacoco-multi-module-project

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Danny Chen
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-15 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7596:
---

Assignee: Danny Chen

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Danny Chen
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-15 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7596:

Reviewers: Danny Chen

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7596) Enable Jacoco code coverage report across multiple modules

2024-04-15 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7596:

Reviewers:   (was: Danny Chen)

> Enable Jacoco code coverage report across multiple modules
> --
>
> Key: HUDI-7596
> URL: https://issues.apache.org/jira/browse/HUDI-7596
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>  Labels: starter
> Fix For: 0.15.0, 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6699) An indexed global timeline (phase2)

2024-04-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6699:
-
Status: Open  (was: In Progress)

> An indexed global timeline (phase2)
> ---
>
> Key: HUDI-6699
> URL: https://issues.apache.org/jira/browse/HUDI-6699
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6787:
-
Story Points: 25  (was: 5)

> Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> --
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6787:
-
Story Points: 5  (was: 25)

> Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> --
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7146] [RFC-77] RFC for secondary index [hudi]

2024-04-15 Thread via GitHub


codope commented on code in PR #10814:
URL: https://github.com/apache/hudi/pull/10814#discussion_r1565589697


##
rfc/rfc-77/rfc-77.md:
##
@@ -0,0 +1,247 @@
+
+
+# RFC-77: Secondary Indexes
+
+## Proposers
+
+- @bhat-vinay
+- @codope
+
+## Approvers
+ - @vinothchandar
+ - @nsivabalan
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-7146
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+In this RFC, we propose implementing Secondary Indexes (SI), a new capability 
in Hudi's metadata table (MDT) based indexing 
+system.  SI are indexes defined on user specified columns of the table. 
Similar to record level indexes,
+SI will improve query performance when the query predicate contains secondary 
keys. The number of files
+that a query needs to scan can be pruned down using secondary indexes.
+
+## Background
+
+Hudi supports different indexes through its MDT. These indexes help to improve 
query performance by
+pruning down the set of files that need to be scanned to build the result set 
(of the query). 
+
+One of the supported index in Hudi is the Record Level Index (RLI). RLI acts 
as a unique-key index and can be used to 
+locate a FileGroup of a record based on its RecordKey. A query having an EQUAL 
or IN predicate on the RecordKey will 
+have a performance boost as the RLI can accurately give a subset of FileGroups 
that contain the rows matching the 
+predicate.
+
+Many workloads have queries with predicates that are not based on RecordKey. 
Such queries cannot use RLI for data
+skipping. Traditional databases have a notion of building indexes (called 
Secondary Index or SI) on user specified 
+columns to aid such queries. This RFC proposes implementing SI in Hudi. Users 
can build SI on columns which are 
+frequently used as filtering columns (i.e columns on which query predicate is 
based on). As with any other index, 
+building and maintaining SI adds overhead on the write path. Users should 
choose wisely based 
+on their workload. Tools can be built to provide guidance on the usefulness of 
indexing a specific column, but it is 
+not in the scope of this RFC.
+
+## Design and Implementation
+This section discusses briefly the goals, design, implementation details of 
supporting SI in Hudi. At a high level,
+the design principle and goals are as follows:
+1. User specifies SI to be built on a given column of a table. A given SI can 
be built on only one column of the table
+(i.e composite keys are not allowed). Any number of SI can be built on a Hudi 
table. The indexes to be built are 
+specified using regular SQL statements.
+2. Metadata of a SI will be tracked through the index metadata file under 
`/.hoodie/.index` (this path can be configurable).
+3. Each SI will be a partition inside Hudi MDT. Index data will not be 
materialized with the base table's data files.
+4. Logical plan of a query will be used to efficiently filter FileGroups based 
on the query predicate and the available
+indexes.
+
+### SQL
+SI can be created using the regular `CREATE INDEX` SQL statement.
+```
+-- PROPOSED SYNTAX WITH `secondary_index` as the index type --
+CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name [USING 
secondary_index](index_column)
+-- Examples --
+CREATE INDEX idx_city on hudi_table USING secondary_index(city)
+CREATE INDEX idx_last_name on hudi_table (last_name)
+
+-- NO CHANGE IN DROP INDEX --
+DROP INDEX idx_city;
+```
+
+`index_name` - Required and validated by parser. `index_name` will be used to 
derive the name of the physical partition
+in MDT by prefixing `secondary_index_`. If the `index_name` is `idx_city`, 
then the MDT partition will be 
+`secondary_index_idx_city`
+
+The index_type will be `secondary_index`. This will be used to distinguish SI 
from other Functional Indexes.
+
+### Secondary Index Metadata
+Secondary index metadata will be managed the same way as Functional Index 
metadata. Since SI will not have any function
+to be applied on each row, the `function_name` will be NULL.
+
+### Index in Metadata Table (MDT)
+Each SI will be stored as a physical partition in the MDT. The partition name 
is derived from the `index_name` by 
+prefixing `secondary_index_`. Each entry in the SI partition will be a mapping 
of the form 
+`secondary_key -> record_key`. `secondary_key` will form the "record key" for 
the record of the SI partition. Note that
+an important design consideration here is that users may choose to build SI on 
a non-unique column of the table.
+
+ Index Initialisation
+Initial build of the secondary index will scan all file slices (of the base 
table) to extract 
+`secondary-key -> record-key` tuple and write it into the secondary index 
partition in the metadata table. 
+This is similar to how RLI is initialised.
+
+ Index Maintenance
+The index needs to be updated on inserts, updates and deletes to the base 
table. Considering that secondary-keys in 
+the base table could be non-unique,

[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7582:
-
Story Points: 4

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


bhat-vinay commented on code in PR #11021:
URL: https://github.com/apache/hudi/pull/11021#discussion_r1566605569


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -349,10 +350,15 @@ case class HoodieFileIndex(spark: SparkSession,
   Option.empty
 } else if (recordKeys.nonEmpty) {
   Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), 
recordKeys))
-} else if (functionalIndex.isIndexAvailable && !queryFilters.isEmpty) {
+} else if (functionalIndex.isIndexAvailable && queryFilters.nonEmpty && 
functionalIndex.extractSparkFunctionNames(queryFilters).nonEmpty) {
+  val functionToColumnNames = 
functionalIndex.extractSparkFunctionNames(queryFilters)
+  // Currently, only one functional index in the query is supported. 
HUDI-7620 for supporting multiple functions.
+  checkState(functionToColumnNames.size == 1, "Currently, only one 
function with functional index in the query is supported")
+  val (indexFunction, targetColumnName) = functionToColumnNames.head
+  val partitionOption = 
functionalIndex.getFunctionalIndexPartition(indexFunction, targetColumnName)
   val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
   val shouldReadInMemory = functionalIndex.shouldReadInMemory(this, 
queryReferencedColumns)
-  val indexDf = functionalIndex.loadFunctionalIndexDataFrame("", 
shouldReadInMemory)
+  val indexDf = 
functionalIndex.loadFunctionalIndexDataFrame(partitionOption.get, 
shouldReadInMemory)

Review Comment:
   `partitionOption` could be empty?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7144) Support query for tables written as partitionBy but synced as non-partitioned

2024-04-15 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837474#comment-17837474
 ] 

Vinoth Chandar commented on HUDI-7144:
--

Things to ensure :

1. the out-of-box experience for partitioned tables should need any additional 
write/qiuery configs to pass in col stats fields etc...
2. what about data type support? uses comparable. 
3. tests around just the MT partition read/write. 

> Support query for tables written as partitionBy but synced as non-partitioned
> -
>
> Key: HUDI-7144
> URL: https://issues.apache.org/jira/browse/HUDI-7144
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7023, we added support to sync any table as non-partitioned table and 
> yet be able to query via Spark with the same performance benefits of 
> partitioned table.
> This ticket extends the functionality end-to-end. If a user executes  
> `spark.write.format("hudi").options(options).partitionBy(partCol).save(basePath)`,
>  then do logical partitioning and sync as non-partitioned table to the 
> catalog. Yet be able to query efficiently,.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
--
Reviewers: Jonathan Vexler

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7144) Support query for tables written as partitionBy but synced as non-partitioned

2024-04-15 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7144:
-
Reviewers: Ethan Guo, Vinoth Chandar  (was: Vinoth Chandar)

> Support query for tables written as partitionBy but synced as non-partitioned
> -
>
> Key: HUDI-7144
> URL: https://issues.apache.org/jira/browse/HUDI-7144
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7023, we added support to sync any table as non-partitioned table and 
> yet be able to query via Spark with the same performance benefits of 
> partitioned table.
> This ticket extends the functionality end-to-end. If a user executes  
> `spark.write.format("hudi").options(options).partitionBy(partCol).save(basePath)`,
>  then do logical partitioning and sync as non-partitioned table to the 
> catalog. Yet be able to query efficiently,.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7570) Update RFC with details on API changes

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7570:
--
Status: Patch Available  (was: In Progress)

> Update RFC with details on API changes
> --
>
> Key: HUDI-7570
> URL: https://issues.apache.org/jira/browse/HUDI-7570
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Given that secondary index can have duplicate keys, the existing 
> `HoodieMergedLogRecordScanner` is insufficient to handle duplicates because 
> it depends on `ExternalSpillableMap` which can only hold unique keys. RFC 
> should clarify how the merged log record scanner will change. We should not 
> be leaking any details to merge handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7570) Update RFC with details on API changes

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7570:
--
Reviewers: Vinoth Chandar

> Update RFC with details on API changes
> --
>
> Key: HUDI-7570
> URL: https://issues.apache.org/jira/browse/HUDI-7570
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Given that secondary index can have duplicate keys, the existing 
> `HoodieMergedLogRecordScanner` is insufficient to handle duplicates because 
> it depends on `ExternalSpillableMap` which can only hold unique keys. RFC 
> should clarify how the merged log record scanner will change. We should not 
> be leaking any details to merge handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7582:
--
Sprint: Sprint 2024-03-25

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


danny0405 opened a new pull request, #11028:
URL: https://github.com/apache/hudi/pull/11028

   ### Change Logs
   
   There is no need to copy for most of the use cases.
   
   ### Impact
   
   no impact.
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7582:
--
Status: In Progress  (was: Open)

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7582:
--
Status: Patch Available  (was: In Progress)

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-7582:
-

Assignee: Sagar Sumit

> Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
> -
>
> Key: HUDI-7582
> URL: https://issues.apache.org/jira/browse/HUDI-7582
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> lookupCandidateFilesInMetadataTable(...) calls 
> FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string 
> for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() 
> tries to lookup and dereference index-definition using this empty string. 
>  
> This part of the code should never have worked - hence it looks like 
> functional index (based on col-stats) is not tested on the query path. trying 
> to get the index-partition to use on the query side seems more involved - the 
> incoming query predicate needs to be parsed to get the (column-names, 
> function-name) for all the query predicate and then fetch the corresponding 
> index-partition by walking through the index-defs maintained in the 
> index-metadata. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

2024-04-15 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
--
Status: Patch Available  (was: In Progress)

> Inserting rows into partitioned table leads to data sanity issues
> -
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 1.0.0-beta1, 0.14.1
>Reporter: Vinaykumar Bhat
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
>   withTempDir { tmp =>
> val tableType = "cow"
>   val tableName = "rides"
>   val basePath = s"${tmp.getCanonicalPath}/$tableName"
>   spark.sql("set hoodie.metadata.enable=true")
>   spark.sql(
> s"""
>|create table $tableName (
>|  id int,
>|  name string,
>|  price int,
>|  ts long
>|) using hudi
>| options (
>|  primaryKey ='id',
>|  type = '$tableType',
>|  preCombineField = 'ts',
>|  hoodie.metadata.record.index.enable = 'true',
>|  hoodie.datasource.write.recordkey.field = 'id'
>| )
>| partitioned by(price)
>| location '$basePath'
>""".stripMargin)
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 20)")
>   spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 20)")
>   spark.sql(s"select id, name, price, ts from $tableName").show(false)
>   }
> }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---++--++
> |id |name|price |ts  |
> +---++--++
> |3  |a3  |20|1000|
> |2  |a2  |20|100 |
> |1  |a1  |1000  |10  |
> +---++--++
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


bhat-vinay commented on code in PR #11021:
URL: https://github.com/apache/hudi/pull/11021#discussion_r1566596914


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -156,7 +156,7 @@ object DataSourceReadOptions {
 
   val ENABLE_DATA_SKIPPING: ConfigProperty[Boolean] = ConfigProperty
 .key("hoodie.enable.data.skipping")
-.defaultValue(false)
+.defaultValue(true)

Review Comment:
   Please revert this change before landing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7577) Avoid MDT compaction instant time conflicts

2024-04-15 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7577.

Resolution: Fixed

Fixed via master branch: 40d4f489389083e3c6d69954361d3de4aec8186a

> Avoid MDT compaction instant time conflicts
> ---
>
> Key: HUDI-7577
> URL: https://issues.apache.org/jira/browse/HUDI-7577
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, core, metadata, table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: Compaction, pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7577] Avoid MDT compaction instant time conflicts (#10992)

2024-04-15 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 40d4f489389 [HUDI-7577] Avoid MDT compaction instant time conflicts 
(#10992)
40d4f489389 is described below

commit 40d4f489389083e3c6d69954361d3de4aec8186a
Author: Danny Chan 
AuthorDate: Tue Apr 16 09:12:19 2024 +0800

[HUDI-7577] Avoid MDT compaction instant time conflicts (#10992)
---
 .../metadata/HoodieBackedTableMetadataWriter.java  |  6 +-
 .../hudi/client/TestJavaHoodieBackedMetadata.java  | 23 +++---
 .../functional/TestHoodieBackedMetadata.java   | 23 +++---
 .../table/timeline/HoodieInstantTimeGenerator.java | 13 ++--
 4 files changed, 36 insertions(+), 29 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index dea317e60b7..a541de03cb3 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -53,6 +53,7 @@ import 
org.apache.hudi.common.table.log.block.HoodieDeleteBlock;
 import 
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
 import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
 import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
@@ -1357,7 +1358,10 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 // The compaction planner will manage to filter out the log files that 
finished with greater completion time.
 // see BaseHoodieCompactionPlanGenerator.generateCompactionPlan for more 
details.
 final String compactionInstantTime = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant().map(HoodieInstant::getTimestamp)
+.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant()
+// minus the pending instant time by 1 millisecond to avoid conflict 
in case when this pending instant was finally been committed
+// as a delta_commit in MDT.
+.map(instant -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(instant.getTimestamp(), 1L))
 .orElse(writeClient.createNewInstantTime(false));
 
 // we need to avoid checking compaction w/ same instant again.
diff --git 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
index 35319a6e403..736eee97e85 100644
--- 
a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
+++ 
b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java
@@ -559,34 +559,35 @@ public class TestJavaHoodieBackedMetadata extends 
TestHoodieMetadataBase {
 .withMaxNumDeltaCommitsBeforeCompaction(4)
 .build()).build();
 initWriteConfigAndMetatableWriter(writeConfig, true);
-doWriteOperation(testTable, "001", INSERT);
-String commitInstant = "002";
-doWriteOperation(testTable, commitInstant, INSERT);
+doWriteOperation(testTable, metaClient.createNewInstantTime(), INSERT);
+doWriteOperation(testTable, metaClient.createNewInstantTime(), INSERT);
 
-// test multi-writer scenario. lets add 1,2,3,4 where 1,2,4 succeeded, but 
3 is still inflight. so latest delta commit in MDT is 4, while 3 is still 
pending
+// test multi-writer scenario. let's add 1,2,3,4 where 1,2,4 succeeded, 
but 3 is still inflight. so latest delta commit in MDT is 4, while 3 is still 
pending
 // in DT and not seen by MDT yet. compaction should not trigger until 3 
goes to completion.
 
 // create an inflight commit for 3
-HoodieCommitMetadata inflightCommitMeta = 
testTable.doWriteOperation("003", UPSERT, emptyList(),
+String inflightInstant = metaClient.createNewInstantTime();
+HoodieCommitMetadata inflightCommitMeta = 
testTable.doWriteOperation(inflightInstant, UPSERT, emptyList(),
 asList("p1", "p2"), 2, false, true);
-doWriteOperation(testTable, "004");
+doWriteOperation(testTable, metaClient.createNewInstantTime());
 HoodieTableMetadata tableMetadata = metadata(writeConfig, context);
 // verify th

Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


danny0405 merged PR #10992:
URL: https://github.com/apache/hudi/pull/10992


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on code in PR #10992:
URL: https://github.com/apache/hudi/pull/10992#discussion_r1566592380


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -106,17 +106,18 @@ public static String instantTimePlusMillis(String 
timestamp, long milliseconds)
   }
 
   public static String instantTimeMinusMillis(String timestamp, long 
milliseconds) {
+final String timestampInMillis = fixInstantTimeCompatibility(timestamp);
 try {
-  String timestampInMillis = fixInstantTimeCompatibility(timestamp);
-  // To work with tests, that generate arbitrary timestamps, we need to 
pad the timestamp with 0s.
-  if (timestampInMillis.length() < MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH) 
{
-return String.format("%0" + timestampInMillis.length() + "d", 0);
-  }
   LocalDateTime dt = LocalDateTime.parse(timestampInMillis, 
MILLIS_INSTANT_TIME_FORMATTER);
   ZoneId zoneId = HoodieTimelineTimeZone.UTC.equals(commitTimeZone) ? 
ZoneId.of("UTC") : ZoneId.systemDefault();
   return 
MILLIS_INSTANT_TIME_FORMATTER.format(dt.atZone(zoneId).toInstant().minusMillis(milliseconds).atZone(zoneId).toLocalDateTime());
 } catch (DateTimeParseException e) {
-  throw new HoodieException(e);
+  // To work with tests, that generate arbitrary timestamps, we need to 
pad the timestamp with 0s.
+  if (isValidInstantTime(timestampInMillis)) {
+return String.format("%0" + timestampInMillis.length() + "d", 
Long.parseLong(timestampInMillis) - milliseconds);
+  } else {
+throw new HoodieException(e);
+  }

Review Comment:
   > do we allow arbitrary timestamps any more
   
   we allows it now, because the start instant time does not really affect the 
correctness, only the compaction instant time and completion time does.
   
   It's better to add real time for tests in the future to keep the test more 
in line with real production env.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on code in PR #10992:
URL: https://github.com/apache/hudi/pull/10992#discussion_r1566591565


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1360,7 +1361,10 @@ protected void compactIfNecessary(BaseHoodieWriteClient 
writeClient, String late
 // The compaction planner will manage to filter out the log files that 
finished with greater completion time.
 // see BaseHoodieCompactionPlanGenerator.generateCompactionPlan for more 
details.
 final String compactionInstantTime = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant().map(HoodieInstant::getTimestamp)
+.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant()
+// minus the pending instant time by 1 millisecond to avoid conflict 
in case when this pending instant was finally been committed
+// as a delta_commit in MDT.
+.map(instant -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(instant.getTimestamp(), 1L))

Review Comment:
   > Compaction with same %s time is already present in the timeline.).
   
   Not really, the earlist pending instant on DT may not have a committed 
delta_commit on MDT timeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11027:
URL: https://github.com/apache/hudi/pull/11027#issuecomment-2058057742

   
   ## CI report:
   
   * b96388ad837c124fb63a8655f295fadebc37319f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23273)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2058057634

   
   ## CI report:
   
   * d8dda49ff97feca5172346047aacb007746568ae Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23214)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Task serialization failed: java.lang.NoSuchMethodError: void org.apache.hudi.common.util.HoodieCommonKryoRegistrar.registerClasses(com.esotericsoftware.kryo.Kryo) [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on issue #11026:
URL: https://github.com/apache/hudi/issues/11026#issuecomment-2058058270

   Looks like a jar conflict, do you have multiple Hudi bundle jars on the 
classpath?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2058057248

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * d5f312761099a9c57394f89c9b481e58773cb17f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22720)
 
   * 7f43200dfc27f8ff499c9d0c9c375b635120f67e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23272)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10992:
URL: https://github.com/apache/hudi/pull/10992#issuecomment-2058052392

   
   ## CI report:
   
   * d8dda49ff97feca5172346047aacb007746568ae UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7577] Avoid MDT compaction instant time conflicts [hudi]

2024-04-15 Thread via GitHub


yihua commented on code in PR #10992:
URL: https://github.com/apache/hudi/pull/10992#discussion_r1566585767


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -1360,7 +1361,10 @@ protected void compactIfNecessary(BaseHoodieWriteClient 
writeClient, String late
 // The compaction planner will manage to filter out the log files that 
finished with greater completion time.
 // see BaseHoodieCompactionPlanGenerator.generateCompactionPlan for more 
details.
 final String compactionInstantTime = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant().map(HoodieInstant::getTimestamp)
+.findInstantsBeforeOrEquals(latestDeltacommitTime).firstInstant()
+// minus the pending instant time by 1 millisecond to avoid conflict 
in case when this pending instant was finally been committed
+// as a delta_commit in MDT.
+.map(instant -> 
HoodieInstantTimeGenerator.instantTimeMinusMillis(instant.getTimestamp(), 1L))

Review Comment:
   Based on the logic, previously the compaction would never be scheduled, 
given it picks the instant time that already exists (entering the if branch 
logging `Compaction with same %s time is already present in the timeline.`).



##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstantTimeGenerator.java:
##
@@ -106,17 +106,18 @@ public static String instantTimePlusMillis(String 
timestamp, long milliseconds)
   }
 
   public static String instantTimeMinusMillis(String timestamp, long 
milliseconds) {
+final String timestampInMillis = fixInstantTimeCompatibility(timestamp);
 try {
-  String timestampInMillis = fixInstantTimeCompatibility(timestamp);
-  // To work with tests, that generate arbitrary timestamps, we need to 
pad the timestamp with 0s.
-  if (timestampInMillis.length() < MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH) 
{
-return String.format("%0" + timestampInMillis.length() + "d", 0);
-  }
   LocalDateTime dt = LocalDateTime.parse(timestampInMillis, 
MILLIS_INSTANT_TIME_FORMATTER);
   ZoneId zoneId = HoodieTimelineTimeZone.UTC.equals(commitTimeZone) ? 
ZoneId.of("UTC") : ZoneId.systemDefault();
   return 
MILLIS_INSTANT_TIME_FORMATTER.format(dt.atZone(zoneId).toInstant().minusMillis(milliseconds).atZone(zoneId).toLocalDateTime());
 } catch (DateTimeParseException e) {
-  throw new HoodieException(e);
+  // To work with tests, that generate arbitrary timestamps, we need to 
pad the timestamp with 0s.
+  if (isValidInstantTime(timestampInMillis)) {
+return String.format("%0" + timestampInMillis.length() + "d", 
Long.parseLong(timestampInMillis) - milliseconds);
+  } else {
+throw new HoodieException(e);
+  }

Review Comment:
   Given the new instant time meaning and completion time, do we allow 
arbitrary timestamps any more (e.g., "1", "2", "13)?  Should we 
create a follow-up ticket to fix all relevant tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6497] Replace FileSystem, Path, and FileStatus usage in `hudi-common` module [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10591:
URL: https://github.com/apache/hudi/pull/10591#issuecomment-2058051993

   
   ## CI report:
   
   * 8207558e8c8714386cf2f71929d6fb08db10617b UNKNOWN
   * d5f312761099a9c57394f89c9b481e58773cb17f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22720)
 
   * 7f43200dfc27f8ff499c9d0c9c375b635120f67e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11027:
URL: https://github.com/apache/hudi/pull/11027#issuecomment-2058052484

   
   ## CI report:
   
   * b96388ad837c124fb63a8655f295fadebc37319f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [MINOR] Remove redundant lines in StreamSync and TestStreamSyncUnitTests [hudi]

2024-04-15 Thread via GitHub


yihua opened a new pull request, #11027:
URL: https://github.com/apache/hudi/pull/11027

   ### Change Logs
   
   As above.
   
   ### Impact
   
   Cleaner code.
   
   ### Risk level
   
   none
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7608] Fix Flink table creation configuration not taking effect when writing… [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on code in PR #11005:
URL: https://github.com/apache/hudi/pull/11005#discussion_r1566579977


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieOptionConfig.scala:
##
@@ -43,6 +43,11 @@ object HoodieOptionConfig {
*/
   val SQL_VALUE_TABLE_TYPE_MOR = "mor"
 
+  /**
+   * The short name for the value of index type.
+   */
+  val SQL_VALUE_INDEX_TYPE = "index.type"

Review Comment:
   `index.type` is a short-cut name only for Flink now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7609] Support array field type whose element type can be nullable [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on code in PR #11006:
URL: https://github.com/apache/hudi/pull/11006#discussion_r1566577832


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java:
##
@@ -140,7 +141,7 @@ private static String convertGroupField(GroupType field) {
 ValidationUtils.checkArgument(field.getFieldCount() == 1, "Illegal 
List type: " + field);
 Type repeatedType = field.getType(0);
 if (isElementType(repeatedType, field.getName())) {
-  return arrayType(repeatedType, false);
+  return arrayType(repeatedType, true);

Review Comment:
   So you might need to validate option `spark.sql.sources.schema.numParts` set 
up within Hive I guess. And this option only affects the usage of Spark engine, 
should we fix the table schema instead where stored within the 
`hoodie.properties`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7609] Support array field type whose element type can be nullable [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on code in PR #11006:
URL: https://github.com/apache/hudi/pull/11006#discussion_r1566577832


##
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java:
##
@@ -140,7 +141,7 @@ private static String convertGroupField(GroupType field) {
 ValidationUtils.checkArgument(field.getFieldCount() == 1, "Illegal 
List type: " + field);
 Type repeatedType = field.getType(0);
 if (isElementType(repeatedType, field.getName())) {
-  return arrayType(repeatedType, false);
+  return arrayType(repeatedType, true);

Review Comment:
   So you might need to validate option `spark.sql.sources.schema.numParts` set 
up within Hive I guess.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Metadata table not cleaned / compacted, log files growing rapidly [hudi]

2024-04-15 Thread via GitHub


danny0405 commented on issue #8567:
URL: https://github.com/apache/hudi/issues/8567#issuecomment-2058029478

   > 2a0969c9972ef746d377dbddd278ef13bf3d299d
   
   For mor table, it should be fine if it is the upsert semantics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2057958997

   
   ## CI report:
   
   * 94171e2cb1dd8066176589376d1af6c49f676b9c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23271)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan [hudi]

2024-04-15 Thread via GitHub


nsivabalan commented on code in PR #10965:
URL: https://github.com/apache/hudi/pull/10965#discussion_r1566489716


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,36 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());

Review Comment:
   1. yeah. After reading Kishan's response, i feel we should fail the 
execution if compaction is being currently attempted by another concurrent 
writer.  
   
   2. Even on REQUESTED: since we are taking a lock and checking for heart beat 
client, wouldn't that ensure only one writer can proceed and the other writer 
will fail. 



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1138,34 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other concurrent
+  // compact job will abort if they attempt to execute compact before 
heartbeat expires
+  // Note that as long as all jobs for this table use this API for 
compact, then this alone should prevent
+  // compact rollbacks from running concurrently to compact commits.
+  txnManager.beginTransaction(instantToCompactOption, 
txnManager.getLastCompletedTransactionOwner());
+  try {
+if (!this.heartbeatClient.isHeartbeatExpired(compactionInstantTime)) {
+  throw new HoodieLockException("Cannot compact instant " + 
compactionInstantTime + " due to heartbeat by existing job");
+}
+  } catch (IOException e) {
+throw new HoodieHeartbeatException("Error accessing heartbeat of 
instant to compact " + compactionInstantTime, e);
+  }
+  this.heartbeatClient.start(compactionInstantTime);
+} finally {
+  txnManager.endTransaction(txnManager.getCurrentTransactionOwner());
+}
 preWrite(compactionInstantTime, WriteOperationType.COMPACT, 
table.getMetaClient());
-return tableServiceClient.compact(compactionInstantTime, shouldComplete);
+HoodieWriteMetadata compactMetadata = 
tableServiceClient.compact(compactionInstantTime, shouldComplete);
+this.heartbeatClient.stop(compactionInstantTime, true);

Review Comment:
   yeah. I see your point. probably every caller when calling stop, should 
remove it from map. 
   looks like we never remove any entry from instantToHeartbeatMap as per 
master (except shutting down entire HeartbeatClient). 
   @n3nash : any pointers on this regard? 
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1135,8 +1137,34 @@ protected void 
completeLogCompaction(HoodieCommitMetadata metadata, HoodieTable
*/
   protected HoodieWriteMetadata compact(String compactionInstantTime, 
boolean shouldComplete) {
 HoodieTable table = createTable(config, context.getHadoopConf().get());
+Option instantToCompactOption = 
Option.fromJavaOptional(table.getActiveTimeline()
+.filterCompletedAndCompactionInstants()
+.getInstants()
+.stream()
+.filter(instant -> 
HoodieActiveTimeline.EQUALS.test(instant.getTimestamp(), compactionInstantTime))
+.findFirst());
+try {
+  // Transaction serves to ensure only one compact job for this instant 
will start heartbeat, and any other conc

Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2057876670

   
   ## CI report:
   
   * aed811322f7c2a2fb539d293fc93b5054d550835 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23268)
 
   * 94171e2cb1dd8066176589376d1af6c49f676b9c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23271)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10991:
URL: https://github.com/apache/hudi/pull/10991#issuecomment-2057814505

   
   ## CI report:
   
   * 9b5a2a5f69fa40f9dbd6e10d0c1c3fe9457b71da Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23269)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2057814333

   
   ## CI report:
   
   * 966e8c85f2afb0ffaf00e12d02eb41b41c68e0bc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23228)
 
   * aed811322f7c2a2fb539d293fc93b5054d550835 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23268)
 
   * 94171e2cb1dd8066176589376d1af6c49f676b9c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7566] Add schema evolution to spark file readers (#10956)

2024-04-15 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new c71ac326b79 [HUDI-7566] Add schema evolution to spark file readers 
(#10956)
c71ac326b79 is described below

commit c71ac326b791fafe30ca0e46b64d85ddedba396a
Author: Jon Vexler 
AuthorDate: Mon Apr 15 16:39:58 2024 -0400

[HUDI-7566] Add schema evolution to spark file readers (#10956)

* add spark 3.3 reader

* add spark3.4

* add spark 3.5

* add spark 3.2

* add spark 3.1

* add spark 3.0

* add spark 2.4

* spark 3.3 use properties class

* spark 3.2 add props class

* spark 3.4 add properties

* add spark 3.5 properties

* add properties spark 3.1

* add props spark 3.0

* add properties spark 2.4

* fix 3.0

* refactor to get rid of properties, spark 3.1

* remove props spark 3.0

* use class model for spark 3.3

* remove props spark 3.3

* remove props spark 3.4

* remove props spark 3.5

* remove props spark 2.4

* remove change

* remove bad import

* add spark 3.3

* add spark 3.4

* add spark 3.5

* add spark 3.2

* add spark 3.1

* add spark 3.0

* add spark 2.4

* fix 2.4

* create a copy of the conf when reading

* make conf copy during read

* add test

* allow vectorized read and comment better

* address review comments 3.5

* rename spark 3.4

* rename for spark3.3

* rename for spark 3.2

* rename spark 3.1

* rename spark 30

* rename for spark 2

* remove empty line

* address hidden review comments

* add missing import

* address comments and add changes to legacy 3.5

* spark 3.4 update legacy

* make changes to spark 3.3 and restore legacy for 3.4 and 3.5

* update spark 3.2

* update spark 3.1

* update spark 3.0

-

Co-authored-by: Jonathan Vexler <=>
---
 .../datasources/parquet/Spark24ParquetReader.scala |  64 +--
 .../Spark3ParquetSchemaEvolutionUtils.scala| 196 +
 .../datasources/parquet/Spark30ParquetReader.scala |  26 ++-
 .../Spark30ParquetSchemaEvolutionUtils.scala   |  54 ++
 .../datasources/parquet/Spark31ParquetReader.scala |  49 --
 .../Spark31ParquetSchemaEvolutionUtils.scala   |  57 ++
 .../datasources/parquet/Spark32ParquetReader.scala | 112 +---
 .../Spark32PlusParquetSchemaEvolutionUtils.scala   |  64 +++
 .../datasources/parquet/Spark33ParquetReader.scala |  18 +-
 .../datasources/parquet/Spark34ParquetReader.scala |  10 +-
 .../datasources/parquet/Spark35ParquetReader.scala |  19 +-
 11 files changed, 590 insertions(+), 79 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24ParquetReader.scala
 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24ParquetReader.scala
index 7fa30a36222..42808f337b7 100644
--- 
a/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24ParquetReader.scala
+++ 
b/hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24ParquetReader.scala
@@ -30,12 +30,12 @@ import org.apache.parquet.hadoop.{ParquetFileReader, 
ParquetInputFormat, Parquet
 import org.apache.spark.TaskContext
 import org.apache.spark.sql.catalyst.InternalRow
 import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
-import org.apache.spark.sql.catalyst.expressions.{JoinedRow, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.{Cast, JoinedRow, UnsafeRow}
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.execution.datasources.{PartitionedFile, 
RecordReaderIterator}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources.Filter
-import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.types.{StructField, StructType}
 
 import java.net.URI
 
@@ -131,8 +131,17 @@ class Spark24ParquetReader(enableVectorizedReader: Boolean,
   }
 
 val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 
0), 0)
+
+// Clone new conf
+val hadoopAttemptConf = new Configuration(sharedConf)
+val (implicitTypeChangeInfos, sparkRequestSchema) = 
HoodieParquetFileFormatHelper.buildImplicitSchemaChangeInfo(hadoopAttemptConf, 
footerFileMetaData, requiredSchema)
+
+if (!implicitTypeChangeInfos.isEmpty) {
+  hadoopAttemptConf.set(ParquetReadSupport.SPARK_ROW_REQUEST

Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex merged PR #10956:
URL: https://github.com/apache/hudi/pull/10956


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057767464

   https://github.com/apache/hudi/assets/26940621/fba674a1-82fc-4b21-ab90-40623835d9f0";>
   azure ci passing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Task serialization failed: java.lang.NoSuchMethodError: void org.apache.hudi.common.util.HoodieCommonKryoRegistrar.registerClasses(com.esotericsoftware.kryo.Kryo) [hudi]

2024-04-15 Thread via GitHub


vbogretsov opened a new issue, #11026:
URL: https://github.com/apache/hudi/issues/11026

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I'm getting dependency error on Spark Executor when run the 
`HoodieMultiTableStreamer` in Spark Operator in Kubernetes:
   
   ```
   24/04/15 19:22:20 ERROR HoodieMultiTableStreamer: error while running 
MultiTableDeltaStreamer for table: my_table1
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: java.lang.NoSuchMethodError: 'void 
org.apache.hudi.common.util.HoodieCommonKryoRegistrar.registerClasses(com.esotericsoftware.kryo.Kryo)'
   java.lang.NoSuchMethodError: 'void 
org.apache.hudi.common.util.HoodieCommonKryoRegistrar.registerClasses(com.esotericsoftware.kryo.Kryo)'
   ...
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. `kubectl apply -f config.yaml` (provided bellow)
   2. `kubectl -n dp2 logs -l app=hudidemo -f` (to get the logs)
   
   **Expected behavior**
   
   The mentioned dependency error does not appear in logs and does not cause 
Hudi Streamer to fail.
   
   **Environment Description**
   
   * Hudi version : 0.14.1
   
   * Spark version : 3.4.0
   
   * Hive version : I use AWSGlueSyncTool from AWS SDK
   
   * Hadoop version : 3.3.4
   
   * Spark Operator version: v1beta2-1.3.7-3.1.1
   
   * Kubernetes version: 1.29 AWS EKS
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * AWS SDK version: 1.12.682
   
   * AWS SDK 2 version: 2.25.13
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   I used the following Dockerfile to build the image `myimage:1.0.0`:
   
   ```
   ARG BASE=apache/spark:3.4.0
   ARG MVNROOT=https://maven-central-eu.storage-download.googleapis.com/maven2
   
   FROM alpine:3.19 as aws-jars
   ARG MVNROOT
   ARG AWSSDK1=1.12.682
   ARG AWSSDK2=2.25.13
   WORKDIR /jars
   RUN wget 
${MVNROOT}/com/amazonaws/aws-java-sdk-bundle/${AWSSDK1}/aws-java-sdk-bundle-${AWSSDK1}.jar
   RUN wget 
${MVNROOT}/software/amazon/awssdk/bundle/${AWSSDK2}/bundle-${AWSSDK2}.jar
   
   FROM alpine:3.19 as hadoop-jars
   ARG MVNROOT
   ARG HADOOP=3.3.4
   WORKDIR /jars
   RUN wget 
${MVNROOT}/org/apache/hadoop/hadoop-common/${HADOOP}/hadoop-common-${HADOOP}.jar
   RUN wget 
${MVNROOT}/org/apache/hadoop/hadoop-aws/${HADOOP}/hadoop-aws-${HADOOP}.jar
   RUN wget 
${MVNROOT}/org/apache/hadoop/hadoop-mapreduce-client-core/${HADOOP}/hadoop-mapreduce-client-core-${HADOOP}.jar
   
   FROM alpine:3.19 as hudi-jars
   ARG MVNROOT
   ARG SCALA=2.12
   ARG HUDI=0.14.1
   WORKDIR /jars
   RUN wget 
${MVNROOT}/org/apache/hudi/hudi-spark3.4-bundle_${SCALA}/${HUDI}/hudi-spark3.4-bundle_${SCALA}-${HUDI}.jar
   RUN wget ${MVNROOT}/org/apache/hudi/hudi-aws/${HUDI}/hudi-aws-${HUDI}.jar
   RUN wget 
${MVNROOT}/org/apache/hudi/hudi-sync-common/${HUDI}/hudi-sync-common-${HUDI}.jar
   RUN wget 
${MVNROOT}/org/apache/hudi/hudi-hive-sync-bundle/${HUDI}/hudi-hive-sync-bundle-${HUDI}.jar
   RUN wget 
${MVNROOT}/org/apache/hudi/hudi-utilities-bundle_${SCALA}/${HUDI}/hudi-utilities-bundle_${SCALA}-${HUDI}.jar
   RUN wget 
${MVNROOT}/org/apache/hudi/hudi-hadoop-mr-bundle/${HUDI}/hudi-hadoop-mr-bundle-${HUDI}.jar
   
   FROM ${BASE} as final
   COPY --from=aws-jars /jars /opt/spark/jars
   COPY --from=hadoop-jars /jars /opt/spark/jars
   COPY --from=hudi-jars /jars /opt/spark/jars
   ENV HOME=/opt/spark
   ENV PATH=/opt/spark/bin:$PATH
   ENV HUDI_CONF_DIR=/etc/hudi
   RUN mkdir -p /opt/spark/tmp
   ```
   
   I can confirm this image works locally being executed in Docker Compose with 
exactly same command line arguments.
   
   My Spark Operator configuration is the following:
   
   ```
   apiVersion: v1
   kind: ConfigMap
   metadata:
 name: etc-hudi
 namespace: dp2
   data:
 hudi-defaults.conf: |
   hoodie.upsert.shuffle.parallelism=8
   hoodie.insert.shuffle.parallelism=8
   hoodie.delete.shuffle.parallelism=8
   hoodie.bulkinsert.shuffle.parallelism=8
 base.properties: |
   hoodie.parquet.small.file.limit=16777216
   hoodie.index.type=GLOBAL_BLOOM
   hoodie.bloom.index.update.partition.path=true
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.database=hudidemo
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.hive_sync.use_jdbc=false
   hoodie.datasource.hive_sync.mode=hms
   
hoodie.streamer.ingestion.tablesToBeIngested=myapp.my_table1,myapp.my_table2
   
hoodie.streamer.ingestion.m

Re: [PR] [HUDI-7618] Add ability to ignore checkpoints in delta streamer [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11018:
URL: https://github.com/apache/hudi/pull/11018#issuecomment-2057697998

   
   ## CI report:
   
   * 755ddfdc5d0a02ac1cf1c35fbf5ccd21e1025a31 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2057684920

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   * 24d5e7f082788b257a42598df5f1d2378e32b041 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


yihua commented on code in PR #10956:
URL: https://github.com/apache/hudi/pull/10956#discussion_r1563556374


##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParquetSchemaEvolutionUtils.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.hadoop.metadata.FileMetaData
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, 
UnsafeProjection}
+import 
org.apache.spark.sql.execution.datasources.Spark3ParquetSchemaEvolutionUtils.pruneInternalSchema
+import 
org.apache.spark.sql.execution.datasources.parquet.{HoodieParquetFileFormatHelper,
 ParquetReadSupport}
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types.{AtomicType, DataType, StructField, 
StructType}
+
+import scala.collection.convert.ImplicitConversions.`collection 
AsScalaIterable`
+
+abstract class Spark3ParquetSchemaEvolutionUtils(sharedConf: Configuration,

Review Comment:
   Thoughts for follow-ups in separate PRs.  I see the schema evolution related 
logic is invoked per reader/file, but I think part of the logic is based on the 
information at the table level, e.g., the internal schema of the table.  Is it 
possible to pass in such info to the reader instead of deriving them per file 
group?



##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParquetSchemaEvolutionUtils.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.hadoop.metadata.FileMetaData
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, 
UnsafeProjection}
+import 
org.apache.spark.sql.execution.datasources.Spark3ParquetSchemaEvolutionUtils.pruneInternalSchema
+import 
org.apache.spark.sql.execution.datasources.parquet.{HoodieParquetFileFormatHelper,
 ParquetReadSupport}
+import org.apache.spark.sql.sources._

Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10991:
URL: https://github.com/apache/hudi/pull/10991#issuecomment-2057613466

   
   ## CI report:
   
   * 2af03c004aef66248dae6283e9c2f1e63e062e75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23229)
 
   * 9b5a2a5f69fa40f9dbd6e10d0c1c3fe9457b71da Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23269)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2057613346

   
   ## CI report:
   
   * 966e8c85f2afb0ffaf00e12d02eb41b41c68e0bc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23228)
 
   * aed811322f7c2a2fb539d293fc93b5054d550835 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23268)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057613284

   
   ## CI report:
   
   * 8943bb4eaf741096203bed688905977d4bf59160 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23265)
 
   * 4c3242159414786c927f13b83013b045c517ff65 UNKNOWN
   * eb58a1a3af3bda46cd0db910ac39e37efd744cdd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23267)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2057600734

   
   ## CI report:
   
   * 966e8c85f2afb0ffaf00e12d02eb41b41c68e0bc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23228)
 
   * aed811322f7c2a2fb539d293fc93b5054d550835 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7269] Fallback to key based merge if positions are missing from log block [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10991:
URL: https://github.com/apache/hudi/pull/10991#issuecomment-2057600936

   
   ## CI report:
   
   * 2af03c004aef66248dae6283e9c2f1e63e062e75 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23229)
 
   * 9b5a2a5f69fa40f9dbd6e10d0c1c3fe9457b71da UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057600621

   
   ## CI report:
   
   * 8943bb4eaf741096203bed688905977d4bf59160 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23265)
 
   * 4c3242159414786c927f13b83013b045c517ff65 UNKNOWN
   * eb58a1a3af3bda46cd0db910ac39e37efd744cdd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057588207

   
   ## CI report:
   
   * 8943bb4eaf741096203bed688905977d4bf59160 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23265)
 
   * 4c3242159414786c927f13b83013b045c517ff65 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex commented on code in PR #10956:
URL: https://github.com/apache/hudi/pull/10956#discussion_r1566258283


##
hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24ParquetReader.scala:
##
@@ -156,30 +177,51 @@ class Spark24ParquetReader(enableVectorizedReader: 
Boolean,
   iter.asInstanceOf[Iterator[InternalRow]]
 } else {
   // ParquetRecordReader returns UnsafeRow
+  val readSupport = new ParquetReadSupport(convertTz)
   val reader = if (pushed.isDefined && enableRecordFilter) {
 val parquetFilter = FilterCompat.get(pushed.get, null)
-new ParquetRecordReader[UnsafeRow](new ParquetReadSupport(convertTz), 
parquetFilter)
+new ParquetRecordReader[UnsafeRow](readSupport, parquetFilter)
   } else {
-new ParquetRecordReader[UnsafeRow](new ParquetReadSupport(convertTz))
+new ParquetRecordReader[UnsafeRow](readSupport)
   }
   val iter = new RecordReaderIterator(reader)
   // SPARK-23457 Register a task completion lister before `initialization`.
   taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
   reader.initialize(split, hadoopAttemptContext)
 
   val fullSchema = requiredSchema.toAttributes ++ 
partitionSchema.toAttributes
-  val joinedRow = new JoinedRow()
-  val appendPartitionColumns = 
GenerateUnsafeProjection.generate(fullSchema, fullSchema)
+  val unsafeProjection = if (implicitTypeChangeInfos.isEmpty) {
+GenerateUnsafeProjection.generate(fullSchema, fullSchema)
+  } else {
+val newFullSchema = new 
StructType(requiredSchema.fields.zipWithIndex.map { case (f, i) =>
+  if (implicitTypeChangeInfos.containsKey(i)) {
+StructField(f.name, implicitTypeChangeInfos.get(i).getRight, 
f.nullable, f.metadata)
+  } else f
+}).toAttributes ++ partitionSchema.toAttributes
+val castSchema = newFullSchema.zipWithIndex.map { case (attr, i) =>
+  if (implicitTypeChangeInfos.containsKey(i)) {
+val srcType = implicitTypeChangeInfos.get(i).getRight
+val dstType = implicitTypeChangeInfos.get(i).getLeft
+val needTimeZone = Cast.needsTimeZone(srcType, dstType)
+Cast(attr, dstType, if (needTimeZone) timeZoneId else None)
+  } else attr
+}
+GenerateUnsafeProjection.generate(castSchema, newFullSchema)
+  }

Review Comment:
   Spark 2 schema evolution diverges, so we just do what's in the legacy 
format. We can follow up and discuss with the original authors of schema on read



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex commented on code in PR #10956:
URL: https://github.com/apache/hudi/pull/10956#discussion_r1566257427


##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParquetSchemaEvolutionUtils.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.hadoop.metadata.FileMetaData
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, 
UnsafeProjection}
+import 
org.apache.spark.sql.execution.datasources.Spark3ParquetSchemaEvolutionUtils.pruneInternalSchema
+import 
org.apache.spark.sql.execution.datasources.parquet.{HoodieParquetFileFormatHelper,
 ParquetReadSupport}
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types.{AtomicType, DataType, StructField, 
StructType}
+
+import scala.collection.convert.ImplicitConversions.`collection 
AsScalaIterable`
+
+abstract class Spark3ParquetSchemaEvolutionUtils(sharedConf: Configuration,
+ filePath: Path,
+ requiredSchema: StructType,
+ partitionSchema: StructType) {
+  // Fetch internal schema
+  private lazy val internalSchemaStr: String = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA)
+
+  private lazy val querySchemaOption: util.Option[InternalSchema] = 
pruneInternalSchema(internalSchemaStr, requiredSchema)
+
+  var shouldUseInternalSchema: Boolean = !isNullOrEmpty(internalSchemaStr) && 
querySchemaOption.isPresent
+
+  private lazy val tablePath: String = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_TABLE_PATH)
+  private lazy val fileSchema: InternalSchema = if (shouldUseInternalSchema) {
+val commitInstantTime = FSUtils.getCommitTime(filePath.getName).toLong;
+val validCommits = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_VALID_COMMITS_LIST)
+InternalSchemaCache.getInternalSchemaByVersionId(commitInstantTime, 
tablePath, sharedConf, if (validCommits == null) "" else validCommits)
+  } else {
+null
+  }
+
+  def rebuildFilterFromParquet(filter: Filter): Filter = {
+rebuildFilterFromParquetHelper(filter, fileSchema, 
querySchemaOption.orElse(null))
+  }
+
+  private def rebuildFilterFromParquetHelper(oldFilter: Filter, fileSchema: 
InternalSchema, querySchema: InternalSchema): Filter = {
+if (fileSchema == null || querySchema == null) {
+  oldFilter
+} else {
+  oldFilter match {
+case eq: EqualTo =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(eq.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else eq.copy(attribute = 
newAttribute)
+case eqs: EqualNullSafe =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(eqs.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else eqs.copy(attribute = 
newAttribute)
+case gt: GreaterThan =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(gt.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else gt.copy(attribute = 
newAttribute)
+case gtr: GreaterThanOrEqual =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(gtr.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else gtr.copy(attribute = 
newAttribute)
+case lt: LessThan 

Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex commented on code in PR #10956:
URL: https://github.com/apache/hudi/pull/10956#discussion_r1566256863


##
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParquetSchemaEvolutionUtils.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.client.utils.SparkInternalSchemaConverter
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.util
+import org.apache.hudi.common.util.InternalSchemaCache
+import org.apache.hudi.common.util.StringUtils.isNullOrEmpty
+import org.apache.hudi.common.util.collection.Pair
+import org.apache.hudi.internal.schema.InternalSchema
+import org.apache.hudi.internal.schema.action.InternalSchemaMerger
+import org.apache.hudi.internal.schema.utils.{InternalSchemaUtils, SerDeHelper}
+import org.apache.parquet.hadoop.metadata.FileMetaData
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Cast, 
UnsafeProjection}
+import 
org.apache.spark.sql.execution.datasources.Spark3ParquetSchemaEvolutionUtils.pruneInternalSchema
+import 
org.apache.spark.sql.execution.datasources.parquet.{HoodieParquetFileFormatHelper,
 ParquetReadSupport}
+import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types.{AtomicType, DataType, StructField, 
StructType}
+
+import scala.collection.convert.ImplicitConversions.`collection 
AsScalaIterable`
+
+abstract class Spark3ParquetSchemaEvolutionUtils(sharedConf: Configuration,
+ filePath: Path,
+ requiredSchema: StructType,
+ partitionSchema: StructType) {
+  // Fetch internal schema
+  private lazy val internalSchemaStr: String = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_QUERY_SCHEMA)
+
+  private lazy val querySchemaOption: util.Option[InternalSchema] = 
pruneInternalSchema(internalSchemaStr, requiredSchema)
+
+  var shouldUseInternalSchema: Boolean = !isNullOrEmpty(internalSchemaStr) && 
querySchemaOption.isPresent
+
+  private lazy val tablePath: String = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_TABLE_PATH)
+  private lazy val fileSchema: InternalSchema = if (shouldUseInternalSchema) {
+val commitInstantTime = FSUtils.getCommitTime(filePath.getName).toLong;
+val validCommits = 
sharedConf.get(SparkInternalSchemaConverter.HOODIE_VALID_COMMITS_LIST)
+InternalSchemaCache.getInternalSchemaByVersionId(commitInstantTime, 
tablePath, sharedConf, if (validCommits == null) "" else validCommits)
+  } else {
+null
+  }
+
+  def rebuildFilterFromParquet(filter: Filter): Filter = {
+rebuildFilterFromParquetHelper(filter, fileSchema, 
querySchemaOption.orElse(null))
+  }
+
+  private def rebuildFilterFromParquetHelper(oldFilter: Filter, fileSchema: 
InternalSchema, querySchema: InternalSchema): Filter = {
+if (fileSchema == null || querySchema == null) {
+  oldFilter
+} else {
+  oldFilter match {
+case eq: EqualTo =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(eq.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else eq.copy(attribute = 
newAttribute)
+case eqs: EqualNullSafe =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(eqs.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else eqs.copy(attribute = 
newAttribute)
+case gt: GreaterThan =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(gt.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else gt.copy(attribute = 
newAttribute)
+case gtr: GreaterThanOrEqual =>
+  val newAttribute = 
InternalSchemaUtils.reBuildFilterName(gtr.attribute, fileSchema, querySchema)
+  if (newAttribute.isEmpty) AlwaysTrue else gtr.copy(attribute = 
newAttribute)
+case lt: LessThan 

Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


jonvex commented on code in PR #10956:
URL: https://github.com/apache/hudi/pull/10956#discussion_r1566256714


##
hudi-spark-datasource/hudi-spark3.0.x/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark30ParquetReader.scala:
##
@@ -174,7 +190,7 @@ class Spark30ParquetReader(enableVectorizedReader: Boolean,
   reader.initialize(split, hadoopAttemptContext)
 
   val fullSchema = requiredSchema.toAttributes ++ 
partitionSchema.toAttributes
-  val unsafeProjection = GenerateUnsafeProjection.generate(fullSchema, 
fullSchema)
+  val unsafeProjection = 
schemaEvolutionUtils.generateUnsafeProjection(fullSchema, timeZoneId)

Review Comment:
   No longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2057514833

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   * 24d5e7f082788b257a42598df5f1d2378e32b041 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23264)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7618] Add ability to ignore checkpoints in delta streamer [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11018:
URL: https://github.com/apache/hudi/pull/11018#issuecomment-2057514710

   
   ## CI report:
   
   * c0923360a546fcfd71c0111b9ea29894fa1fe7f3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23251)
 
   * 755ddfdc5d0a02ac1cf1c35fbf5ccd21e1025a31 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23266)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057514363

   
   ## CI report:
   
   * be7795021e2cffe600a109448ed02e5860385b9f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23262)
 
   * 8943bb4eaf741096203bed688905977d4bf59160 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23265)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2057502357

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   * 24d5e7f082788b257a42598df5f1d2378e32b041 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7618] Add ability to ignore checkpoints in delta streamer [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11018:
URL: https://github.com/apache/hudi/pull/11018#issuecomment-2057502296

   
   ## CI report:
   
   * c0923360a546fcfd71c0111b9ea29894fa1fe7f3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23251)
 
   * 755ddfdc5d0a02ac1cf1c35fbf5ccd21e1025a31 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7566] Add schema evolution to spark file readers [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #10956:
URL: https://github.com/apache/hudi/pull/10956#issuecomment-2057501959

   
   ## CI report:
   
   * be7795021e2cffe600a109448ed02e5860385b9f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23262)
 
   * 8943bb4eaf741096203bed688905977d4bf59160 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7582] Fix functional index lookup [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11021:
URL: https://github.com/apache/hudi/pull/11021#issuecomment-2057490440

   
   ## CI report:
   
   * 8cdf539f2193660299a3894b59d16a7c2b1a59fb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7618] Add ability to ignore checkpoints in delta streamer [hudi]

2024-04-15 Thread via GitHub


sampan-s-nayak commented on code in PR #11018:
URL: https://github.com/apache/hudi/pull/11018#discussion_r1566189979


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java:
##
@@ -424,6 +439,11 @@ public static class Config implements Serializable {
 @Parameter(names = {"--config-hot-update-strategy-class"}, description = 
"Configuration hot update in continuous mode")
 public String configHotUpdateStrategyClass = "";
 
+@Parameter(names = {"--ignore-checkpoint"}, description = "Set this config 
with a unique value, recommend using a timestamp value or UUID."
++ " Setting this config indicates that the subsequent sync should 
ignore the last committed checkpoint for the source. The config value is stored"
++ " in the commit history, so setting the config with same values 
would not have any affect.")
+public String ignoreCheckpoint = null;

Review Comment:
   addressed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

2024-04-15 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7c12decc86c [HUDI-6762] Removed usages of 
MetadataRecordsGenerationParams (#10962)
7c12decc86c is described below

commit 7c12decc86ccfbdc8bd06fe64d3e1c507cbbfbf6
Author: Vova Kolmakov 
AuthorDate: Tue Apr 16 00:05:57 2024 +0700

[HUDI-6762] Removed usages of MetadataRecordsGenerationParams (#10962)

Co-authored-by: Vova Kolmakov 
---
 .../metadata/HoodieBackedTableMetadataWriter.java  | 158 ++--
 .../hudi/metadata/HoodieTableMetadataUtil.java | 281 -
 .../metadata/MetadataRecordsGenerationParams.java  |  89 ---
 3 files changed, 234 insertions(+), 294 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
index 891cc88b9da..dea317e60b7 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
@@ -262,7 +262,7 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   // NOTE: It needs to be guarded by async index config because if that is 
enabled then initialization happens through the index scheduler.
   if (!dataWriteConfig.isMetadataAsyncIndex()) {
 Set completedPartitions = 
dataMetaClient.getTableConfig().getMetadataPartitions();
-LOG.info("Async metadata indexing disabled and following partitions 
already initialized: " + completedPartitions);
+LOG.info("Async metadata indexing disabled and following partitions 
already initialized: {}", completedPartitions);
 // TODO: fix the filter to check for exact partition name, e.g. 
completedPartitions could have func_index_datestr,
 //   but now the user is trying to initialize the 
func_index_dayhour partition.
 this.enabledPartitionTypes.stream()
@@ -345,12 +345,6 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
   LOG.warn("Metadata Table will need to be re-initialized as no instants 
were found");
   return true;
 }
-
-final String latestMetadataInstantTimestamp = 
latestMetadataInstant.get().getTimestamp();
-if (latestMetadataInstantTimestamp.startsWith(SOLO_COMMIT_TIMESTAMP)) { // 
the initialization timestamp is SOLO_COMMIT_TIMESTAMP + offset
-  return false;
-}
-
 return false;
   }
 
@@ -411,8 +405,8 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 for (MetadataPartitionType partitionType : partitionsToInit) {
   // Find the commit timestamp to use for this partition. Each 
initialization should use its own unique commit time.
   String commitTimeForPartition = 
generateUniqueCommitInstantTime(initializationTime);
-
-  LOG.info("Initializing MDT partition " + partitionType.name() + " at 
instant " + commitTimeForPartition);
+  String partitionTypeName = partitionType.name();
+  LOG.info("Initializing MDT partition {} at instant {}", 
partitionTypeName, commitTimeForPartition);
 
   Pair> fileGroupCountAndRecordsPair;
   try {
@@ -438,37 +432,41 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableM
 fileGroupCountAndRecordsPair = 
initializeFunctionalIndexPartition(functionalIndexPartitionsToInit.iterator().next());
 break;
   default:
-throw new HoodieMetadataException("Unsupported MDT partition type: 
" + partitionType);
+throw new HoodieMetadataException(String.format("Unsupported MDT 
partition type: %s", partitionType));
 }
   } catch (Exception e) {
 String metricKey = partitionType.getPartitionPath() + "_" + 
HoodieMetadataMetrics.BOOTSTRAP_ERR_STR;
 metrics.ifPresent(m -> m.setMetric(metricKey, 1));
-LOG.error("Bootstrap on " + partitionType.getPartitionPath() + " 
partition failed for "
-+ metadataMetaClient.getBasePath(), e);
-throw new HoodieMetadataException(partitionType.getPartitionPath()
-+ " bootstrap failed for " + metadataMetaClient.getBasePath(), e);
+String errMsg = String.format("Bootstrap on %s partition failed for 
%s",
+partitionType.getPartitionPath(), 
metadataMetaClient.getBasePathV2());
+LOG.error(errMsg, e);
+throw new HoodieMetadataException(errMsg, e);
   }
 
-  LOG.info(String.format("Initializing %s index with %d mappings and %d 
file groups.", partitionType.name(), fileGroupCountAndRecordsPair.getKey(),
-  fileGroupCountAndRecordsPair.getValue().cou

  1   2   >