Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]

2023-11-07 Thread via GitHub


xuzifu666 commented on code in PR #9988:
URL: https://github.com/apache/hudi/pull/9988#discussion_r1384544659


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala:
##
@@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with 
ProcedureBuilder {
 val commits: util.List[HoodieInstant] = 
timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant])
 
 if (commits.isEmpty) Seq.empty[Row] else {
-  commits.toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq
+  if (limit.isDefined) {
+
commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq

Review Comment:
   I try to refactor the code for abstracting out limit in parent class,but 
seems not fit,because would get parameter from sub-class and cannot confirm the 
order of limit,so I add ut first. at the same time, show procedure commands are 
not much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10013:
URL: https://github.com/apache/hudi/pull/10013#issuecomment-1801234649

   
   ## CI report:
   
   * 7272943f1fe1d3c2683fb97bd13f34658e2e04df Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20732)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9998:
URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801234529

   
   ## CI report:
   
   * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707)
 
   * 9cd41c8ef03048bb724990ecf93c9db3b5883734 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20728)
 
   * 4689bc88d7a4df6c42918dcb0fc1cc94bc7a05a6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20731)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9922:
URL: https://github.com/apache/hudi/pull/9922#issuecomment-1801234330

   
   ## CI report:
   
   * 71eb41cec2aa93366754e0edf14767febca0c40d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20580)
 
   * b351705a990c8ea6b454dade0a33af1090cdf85c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20730)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]

2023-11-07 Thread via GitHub


seekforshell commented on issue #10012:
URL: https://github.com/apache/hudi/issues/10012#issuecomment-1801229697

   > Did you check your table creation schema persisted in `hoodie.properties` 
about the timestamp precision represented as avro format?
   
   yes,here it is:
   `#Properties saved on 2023-11-06T09:16:47.982Z
   #Mon Nov 06 17:16:47 CST 2023
   hoodie.table.precombine.field=precombine_field
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.table.partition.fields=__partition_field
   hoodie.table.type=MERGE_ON_READ
   hoodie.archivelog.folder=archived
   
hoodie.compaction.payload.class=org.apache.hudi.common.model.EventTimeAvroPayload
   hoodie.timeline.layout.version=1
   hoodie.table.version=5
   hoodie.table.recordkey.fields=source_from,id
   hoodie.datasource.write.partitionpath.urlencode=false
   hoodie.table.name=air08_airflow_bucket_mor_t2
   
hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexAvroKeyGenerator
   hoodie.datasource.write.hive_style_partitioning=false
   
hoodie.table.create.schema={"type"\:"record","name"\:"record","fields"\:[{"name"\:"source_from","type"\:["null","int"],"default"\:null},{"name"\:"id","type"\:["null","long"],"default"\:null},{"name"\:"name","type"\:["null","string"],"default"\:null},{"name"\:"create_time","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"price","type"\:["null",{"type"\:"fixed","name"\:"fixed","namespace"\:"record.price","size"\:6,"logicalType"\:"decimal","precision"\:14,"scale"\:2}],"default"\:null},{"name"\:"extend","type"\:["null","string"],"default"\:null},{"name"\:"count","type"\:["null","long"],"default"\:null},{"name"\:"create_date","type"\:["null",{"type"\:"int","logicalType"\:"date"}],"default"\:null},{"name"\:"ext_dt","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"precombine_field","type"\:["null","string"],"default"\:null},{"name"\:"sync_deleted","type"\:["null","int"],"default"\:null},{"name"\:"
 
sync_time","type"\:["null",{"type"\:"long","logicalType"\:"timestamp-millis"}],"default"\:null},{"name"\:"__binlog_file","type"\:["null","string"],"default"\:null},{"name"\:"__pos","type"\:["null","int"],"default"\:null},{"name"\:"source_sys","type"\:["null","int"],"default"\:null},{"name"\:"__partition_field","type"\:["null","int"],"default"\:null}]}
   hoodie.table.checksum=3920591838
   `
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10013:
URL: https://github.com/apache/hudi/pull/10013#issuecomment-1801226730

   
   ## CI report:
   
   * 7272943f1fe1d3c2683fb97bd13f34658e2e04df UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9998:
URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801226585

   
   ## CI report:
   
   * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707)
 
   * 9cd41c8ef03048bb724990ecf93c9db3b5883734 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20728)
 
   * 4689bc88d7a4df6c42918dcb0fc1cc94bc7a05a6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9922:
URL: https://github.com/apache/hudi/pull/9922#issuecomment-1801226273

   
   ## CI report:
   
   * 71eb41cec2aa93366754e0edf14767febca0c40d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20580)
 
   * b351705a990c8ea6b454dade0a33af1090cdf85c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7050) Flink hoodiehivecatalog supports hadoop parameters

2023-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7050:
-
Labels: pull-request-available  (was: )

> Flink hoodiehivecatalog supports hadoop parameters
> --
>
> Key: HUDI-7050
> URL: https://issues.apache.org/jira/browse/HUDI-7050
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink-sql
>Reporter: waywtdcc
>Priority: Major
>  Labels: pull-request-available
>
> Flink hoodiehivecatalog supports hadoop parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7050]Flink HoodieHiveCatalog supports hadoop parameters [hudi]

2023-11-07 Thread via GitHub


waywtdcc opened a new pull request, #10013:
URL: https://github.com/apache/hudi/pull/10013

   ### Change Logs
   
   Flink HoodieHiveCatalog supports hadoop parameters
   
   ### Impact
   
   Flink HoodieHiveCatalog supports hadoop parameters
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   Flink HoodieHiveCatalog supports hadoop parameters
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7050) Flink hoodiehivecatalog supports hadoop parameters

2023-11-07 Thread waywtdcc (Jira)
waywtdcc created HUDI-7050:
--

 Summary: Flink hoodiehivecatalog supports hadoop parameters
 Key: HUDI-7050
 URL: https://issues.apache.org/jira/browse/HUDI-7050
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink-sql
Reporter: waywtdcc


Flink hoodiehivecatalog supports hadoop parameters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on issue #10012:
URL: https://github.com/apache/hudi/issues/10012#issuecomment-1801196648

   Did you check your table creation schema persisted in `hoodie.properties` 
about the timestamp precision represented as avro format?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on code in PR #9998:
URL: https://github.com/apache/hudi/pull/9998#discussion_r1386080725


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -427,7 +427,7 @@ public class HoodieWriteConfig extends HoodieConfig {
 
   public static final ConfigProperty 
INSTANT_STATE_TIMELINE_SERVER_BASED = ConfigProperty
   .key("hoodie.instant_state.timeline_server_based.enabled")
-  .defaultValue(false)
+  .defaultValue(true)

Review Comment:
   This is only an improvement to Flink writers currently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]

2023-11-07 Thread via GitHub


yihua commented on code in PR #10007:
URL: https://github.com/apache/hudi/pull/10007#discussion_r1386080312


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala:
##
@@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: 
Broadcast[HoodieTableState],
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
 if (!supportBatchCalled) {
   supportBatchCalled = true
-  supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema)
+  supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, 
schema)
 }
 supportBatchResult
   }
 
+  private def wrapWithBatchConverter(reader: PartitionedFile => 
Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = {

Review Comment:
   Right.  @jonvex I think Spark internally handles the batch processing 
(`InternalRow` vs `ColumnarBatch`) based on the boolean `supportBatch` returns. 
 So we don't have to do batch converter here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Solution for synchronizing the entire database table in flink [hudi]

2023-11-07 Thread via GitHub


ad1happy2go commented on issue #9965:
URL: https://github.com/apache/hudi/issues/9965#issuecomment-1801186970

   @bajiaolong 
   
   1. Its not limited to 20, I guess what @danny0405 meant is number of tables 
should be handful as you may need to manage those number of streams.
   2. Not sure if there is a way to only read specific partitions from kafka 
topic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7030) Log reader data lost as that not consistent behavior in timeline's containsInstant

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7030.

Resolution: Fixed

Fixed via master branch: e731755b99057dc916378f1f7e95c73642ff96e8

> Log reader data lost as that not consistent behavior in timeline's 
> containsInstant 
> ---
>
> Key: HUDI-7030
> URL: https://issues.apache.org/jira/browse/HUDI-7030
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: ann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: image-2023-11-03-19-48-29-441.png, 
> image-2023-11-03-19-49-22-894.png, image-2023-11-03-19-50-11-849.png, 
> image-2023-11-03-19-58-39-495.png, image-2023-11-03-20-06-00-579.png, 
> image-2023-11-03-20-06-13-905.png, image-2023-11-03-20-07-30-201.png
>
>
> Log reader filtered all log data blocks which come from inflight instant. 
> !image-2023-11-03-19-49-22-894.png!
> *containsInstant* return false when input instant's timestamp is not equal as 
> anyone instant timestamp in inflight timeline. 
> !image-2023-11-03-20-07-30-201.png!
> But now, in timeline's *containsInstant* that input is instant's timestamp, 
> it would return true.
>  
> When input is the instant with default_millis_ext, instant's timestamp is 
> less than someone instant timestamp in timeline. 
> !image-2023-11-03-19-50-11-849.png!
> In finally, log reader skipped the completed delta commit instant and caused 
> data lost.
> !image-2023-11-03-19-58-39-495.png!
> I think timeline's containsInstant should have consistent behavior and update 
> containsOrBeforeTimelineStarts to containsInstant
> !image-2023-11-03-19-48-29-441.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7030) Log reader data lost as that not consistent behavior in timeline's containsInstant

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7030:
-
Fix Version/s: 1.0.0
   0.14.1

> Log reader data lost as that not consistent behavior in timeline's 
> containsInstant 
> ---
>
> Key: HUDI-7030
> URL: https://issues.apache.org/jira/browse/HUDI-7030
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: ann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: image-2023-11-03-19-48-29-441.png, 
> image-2023-11-03-19-49-22-894.png, image-2023-11-03-19-50-11-849.png, 
> image-2023-11-03-19-58-39-495.png, image-2023-11-03-20-06-00-579.png, 
> image-2023-11-03-20-06-13-905.png, image-2023-11-03-20-07-30-201.png
>
>
> Log reader filtered all log data blocks which come from inflight instant. 
> !image-2023-11-03-19-49-22-894.png!
> *containsInstant* return false when input instant's timestamp is not equal as 
> anyone instant timestamp in inflight timeline. 
> !image-2023-11-03-20-07-30-201.png!
> But now, in timeline's *containsInstant* that input is instant's timestamp, 
> it would return true.
>  
> When input is the instant with default_millis_ext, instant's timestamp is 
> less than someone instant timestamp in timeline. 
> !image-2023-11-03-19-50-11-849.png!
> In finally, log reader skipped the completed delta commit instant and caused 
> data lost.
> !image-2023-11-03-19-58-39-495.png!
> I think timeline's containsInstant should have consistent behavior and update 
> containsOrBeforeTimelineStarts to containsInstant
> !image-2023-11-03-19-48-29-441.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7030] Update containsInstant without containsOrBeforeTimelineStarts to fix data lost (#9982)

2023-11-07 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e731755b990 [HUDI-7030] Update containsInstant without 
containsOrBeforeTimelineStarts to fix data lost (#9982)
e731755b990 is described below

commit e731755b99057dc916378f1f7e95c73642ff96e8
Author: xoln ann 
AuthorDate: Wed Nov 8 14:39:32 2023 +0800

[HUDI-7030] Update containsInstant without containsOrBeforeTimelineStarts 
to fix data lost (#9982)
---
 .../hudi/client/functional/TestHoodieIndex.java | 21 +
 .../table/timeline/HoodieDefaultTimeline.java   |  2 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java
index 4518b909813..37199c783bb 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java
@@ -553,6 +553,27 @@ public class TestHoodieIndex extends 
TestHoodieMetadataBase {
 assertFalse(timeline.empty());
 assertFalse(HoodieIndexUtils.checkIfValidCommit(timeline, 
instantTimestamp));
 assertFalse(HoodieIndexUtils.checkIfValidCommit(timeline, 
instantTimestampSec));
+
+// Check the completed delta commit instant which is end with 
DEFAULT_MILLIS_EXT timestamp
+// Timestamp not contain in inflight timeline, checkContainsInstant() 
should return false
+// Timestamp contain in inflight timeline, checkContainsInstant() should 
return true
+String checkInstantTimestampSec = instantTimestamp.substring(0, 
instantTimestamp.length() - 
HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length());
+String checkInstantTimestamp = checkInstantTimestampSec + 
HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT;
+Thread.sleep(2000); // sleep required so that new timestamp differs in the 
seconds rather than msec
+String newTimestamp = writeClient.createNewInstantTime();
+String newTimestampSec = newTimestamp.substring(0, newTimestamp.length() - 
HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length());
+final HoodieInstant instant5 = new HoodieInstant(true, 
HoodieTimeline.DELTA_COMMIT_ACTION, newTimestamp);
+timeline = new HoodieDefaultTimeline(Stream.of(instant5), 
metaClient.getActiveTimeline()::getInstantDetails);
+assertFalse(timeline.empty());
+assertFalse(timeline.containsInstant(checkInstantTimestamp));
+assertFalse(timeline.containsInstant(checkInstantTimestampSec));
+
+final HoodieInstant instant6 = new HoodieInstant(true, 
HoodieTimeline.DELTA_COMMIT_ACTION, newTimestampSec + 
HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT);
+timeline = new HoodieDefaultTimeline(Stream.of(instant6), 
metaClient.getActiveTimeline()::getInstantDetails);
+assertFalse(timeline.empty());
+assertFalse(timeline.containsInstant(newTimestamp));
+assertFalse(timeline.containsInstant(checkInstantTimestamp));
+assertTrue(timeline.containsInstant(instant6.getTimestamp()));
   }
 
   @Test
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
index ec7c9633576..ecf7c938b01 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
@@ -440,7 +440,7 @@ public class HoodieDefaultTimeline implements 
HoodieTimeline {
 // Check for older timestamp which have sec granularity and an extension 
of DEFAULT_MILLIS_EXT may have been added via Timeline operations
 if (ts.length() == 
HoodieInstantTimeGenerator.MILLIS_INSTANT_TIMESTAMP_FORMAT_LENGTH && 
ts.endsWith(HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT)) {
   final String actualOlderFormatTs = ts.substring(0, ts.length() - 
HoodieInstantTimeGenerator.DEFAULT_MILLIS_EXT.length());
-  return containsOrBeforeTimelineStarts(actualOlderFormatTs);
+  return containsInstant(actualOlderFormatTs);
 }
 
 return false;



Re: [PR] [HUDI-7030] update containsInstant without containsOrBeforeTimelineStarts to fix data lost [hudi]

2023-11-07 Thread via GitHub


danny0405 merged PR #9982:
URL: https://github.com/apache/hudi/pull/9982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7030] update containsInstant without containsOrBeforeTimelineStarts to fix data lost [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on PR #9982:
URL: https://github.com/apache/hudi/pull/9982#issuecomment-1801183877

   Tests have passed: 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20704&view=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Change some default configs for 1.0.0-beta [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9998:
URL: https://github.com/apache/hudi/pull/9998#issuecomment-1801183250

   
   ## CI report:
   
   * 420bf60614d20a1caf77bb6616e5fb8d7420b89e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20707)
 
   * 9cd41c8ef03048bb724990ecf93c9db3b5883734 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]

2023-11-07 Thread via GitHub


xuzifu666 commented on code in PR #9988:
URL: https://github.com/apache/hudi/pull/9988#discussion_r1386071295


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala:
##
@@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with 
ProcedureBuilder {
 val commits: util.List[HoodieInstant] = 
timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant])
 
 if (commits.isEmpty) Seq.empty[Row] else {
-  commits.toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq
+  if (limit.isDefined) {
+
commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq

Review Comment:
   try to construnct a 'limit and collect' method in parent class,but had 2 
problem to face:
   1. parameter can be list or rdd,cannot keep it the same;
   2. some list need handle singe logic,some not need,this cause to need 
implement it in subclass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7032] ShowProcedures show add limit syntax to keep the same [hudi]

2023-11-07 Thread via GitHub


xuzifu666 commented on code in PR #9988:
URL: https://github.com/apache/hudi/pull/9988#discussion_r1386071295


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala:
##
@@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with 
ProcedureBuilder {
 val commits: util.List[HoodieInstant] = 
timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant])
 
 if (commits.isEmpty) Seq.empty[Row] else {
-  commits.toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq
+  if (limit.isDefined) {
+
commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq

Review Comment:
   try to construnct a 'limit and collect' method in parent class,but had 2 
problem to face:



##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowSavepointsProcedure.scala:
##
@@ -54,7 +56,11 @@ class ShowSavepointsProcedure extends BaseProcedure with 
ProcedureBuilder {
 val commits: util.List[HoodieInstant] = 
timeline.getReverseOrderedInstants.collect(Collectors.toList[HoodieInstant])
 
 if (commits.isEmpty) Seq.empty[Row] else {
-  commits.toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq
+  if (limit.isDefined) {
+
commits.stream().limit(limit.get.asInstanceOf[Int]).toArray.map(instant => 
instant.asInstanceOf[HoodieInstant].getTimestamp).map(p => Row(p)).toSeq

Review Comment:
   try to construnct a 'limit and collect' method in parent class,but had 2 
problem to face:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [BUG] Spark will read invalid timestamp(3) data when record in log is older than the same in parquet. [hudi]

2023-11-07 Thread via GitHub


seekforshell opened a new issue, #10012:
URL: https://github.com/apache/hudi/issues/10012

   
   Describe the problem you faced
   
   Spark read invalid timestamp(3) data when record in log is older than the 
same in parquet. 
   
   To Reproduce
   
   1. create a mor table with timestamp(3) type. 
   eg.
CREATE EXTERNAL TABLE `xxx.bucket_mor_t2`( 
  `_hoodie_commit_time` string COMMENT '', 
  `_hoodie_commit_seqno` string COMMENT '',
  `_hoodie_record_key` string COMMENT '',  
  `_hoodie_partition_path` string COMMENT '',  
  `_hoodie_file_name` string COMMENT '',   
  `source_from` int COMMENT '',
  `id` bigint COMMENT '',  
  `name` string COMMENT '',
  `create_time` timestamp COMMENT '',  
  `price` decimal(14,2) COMMENT '',
  `extend` string COMMENT '',  
  `count` bigint COMMENT '',   
  `create_date` date COMMENT '',   
  `ext_dt` timestamp COMMENT '',   
  `precombine_field` string COMMENT '',
  `sync_deleted` int COMMENT '',   
  `sync_time` timestamp COMMENT '',
  `__binlog_file` string COMMENT '',   
  `__pos` int COMMENT '',  
  `source_sys` int COMMENT '') 
PARTITIONED BY (   
  `__partition_field` int COMMENT '')  
ROW FORMAT SERDE   
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='false',  
  'path'='hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2')  
STORED AS INPUTFORMAT  
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'  
OUTPUTFORMAT   
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION   
  'hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2' 
TBLPROPERTIES (
  'connector'='hudi',  
  'hoodie.datasource.write.recordkey.field'='source_from,id',  
  'last_commit_time_sync'='20231106172508127', 
  'path'='hdfs://NameNodeService1/xxx/xxx/bucket_mor_t2',  
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1',  
  'spark.sql.sources.schema.numParts'='1', 
  
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"source_from","type":"integer","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"create_time","type":"timestamp","nullable":true,"metadata":{}},{"name":"price","type":"decimal(14,2)","nullable":true,"metadata":{}},{"name":"extend","type":"string","nullable":true,"metadata":{}},{"name":"count","type":"long","nullable":true,"metadata":{}},{"name":"create_date","type":"date","nullable":true,"metadata":{}},{"name":"ext_dt","ty
 
pe":"timestamp","nullable":true,"metadata":{}},{"name":"precombine_field","type":"string","nullable":true,"metadata":{}},{"name":"sync_deleted","type":"integer","nullable":true,"metadata":{}},{"name":"sync_time","type":"timestamp","nullable":true,"metadata":{}},{"name":"__binlog_file","type":"string","nullable":true,"metadata":{}},{"name":"__pos","type":"integer","nullable":true,"metadata":{}},{"name":"source_sys","type":"integer","nullable":true,"metadata":{}},{"name":"__partition_field","type":"integer","nullable":true,"metadata":{}}]}',
  
  'spark.sql.sources.schema.partCol.0'='__partition_field',  
  'table.type'='MERGE_ON_READ',
  'transient_lastDdlTime'='1692251328')
   
   2. insert new data into parquet with flink engine. eg. insert a record(id=1) 
with precombine value = 013088002803892750
   
   3. mock binlog(same record in step2) with precombine value = 1 (which is 
smaller than before) and commit but don't do compaction
   
   finally, read record(id=1) in snapthot mode with spark sql. invalid data 
will occur:
   
   
![b6c3e286dd36ef29f47f6ec569983e82](https://github.com/apache/hudi/assets/8132965/06d3a4b5-ae06-4387-9b2a-0e6b12127e2a)
   
   
   Expected behavior
   
   when

Re: [PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10011:
URL: https://github.com/apache/hudi/pull/10011#issuecomment-1801175482

   
   ## CI report:
   
   * 1e96f587385fb7969f2f24946dcec50f9533dee8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20727)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10010:
URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801175442

   
   ## CI report:
   
   * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723)
 
   * aa9fd9357a2398f3d35e9e3bb71cd9bee4be8432 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20726)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]

2023-11-07 Thread via GitHub


codope commented on code in PR #10009:
URL: https://github.com/apache/hudi/pull/10009#discussion_r1386050749


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieIncrementalFileIndex.scala:
##
@@ -36,9 +36,10 @@ class HoodieIncrementalFileIndex(override val spark: 
SparkSession,
  override val schemaSpec: Option[StructType],
  override val options: Map[String, String],
  @transient override val fileStatusCache: 
FileStatusCache = NoopCache,
- override val includeLogFiles: Boolean)
+ override val includeLogFiles: Boolean,
+ override val shouldEmbedFileSlices: Boolean)

Review Comment:
   do we have some bootstrap tests covering the path w/ and w/o 
`shouldEmbedFileSlices`?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieIncrementalFileIndex.scala:
##
@@ -47,52 +48,52 @@ class HoodieIncrementalFileIndex(override val spark: 
SparkSession,
 val fileSlices = 
mergeOnReadIncrementalRelation.listFileSplits(partitionFilters, dataFilters)
 if (fileSlices.isEmpty) {
   Seq.empty
-}
-
-val prunedPartitionsAndFilteredFileSlices = fileSlices.map {
-  case (partitionValues, fileSlices) =>
-if (shouldEmbedFileSlices) {
-  val baseFileStatusesAndLogFileOnly: Seq[FileStatus] = 
fileSlices.map(slice => {
-if (slice.getBaseFile.isPresent) {
-  slice.getBaseFile.get().getFileStatus
-} else if (slice.getLogFiles.findAny().isPresent) {
-  slice.getLogFiles.findAny().get().getFileStatus
+} else {
+  val prunedPartitionsAndFilteredFileSlices = fileSlices.map {
+case (partitionValues, fileSlices) =>
+  if (shouldEmbedFileSlices) {

Review Comment:
   I see some opportunity to reuse code here and in `HoodieFileIndex.listFiles`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT]insert_overwrite mode writing 2 times more duplicates [hudi]

2023-11-07 Thread via GitHub


ad1happy2go commented on issue #9992:
URL: https://github.com/apache/hudi/issues/9992#issuecomment-1801165966

   @rishabhreply Did this resolve your doubts. Let us know if you need any more 
help. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10011:
URL: https://github.com/apache/hudi/pull/10011#issuecomment-1801164803

   
   ## CI report:
   
   * 1e96f587385fb7969f2f24946dcec50f9533dee8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10010:
URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801164759

   
   ## CI report:
   
   * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723)
 
   * aa9fd9357a2398f3d35e9e3bb71cd9bee4be8432 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]

2023-11-07 Thread via GitHub


codope commented on code in PR #10007:
URL: https://github.com/apache/hudi/pull/10007#discussion_r1385970457


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala:
##
@@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: 
Broadcast[HoodieTableState],
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
 if (!supportBatchCalled) {
   supportBatchCalled = true
-  supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema)
+  supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, 
schema)
 }
 supportBatchResult
   }
 
+  private def wrapWithBatchConverter(reader: PartitionedFile => 
Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = {

Review Comment:
   why is this needed? i think the flatmap per row could incur some significant 
cost for a large batch. Instead of wrapping everytime, can it be guarded for 
some cases such as when schema on read is enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


majian1998 commented on code in PR #10010:
URL: https://github.com/apache/hudi/pull/10010#discussion_r1386042124


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import com.codahale.metrics.MetricRegistry;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+public class MetricsFileSystemReporter extends MetricsReporter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(MetricsFileSystemReporter.class);
+  private MetricRegistry metricRegistry;
+  private SerializableConfiguration hadoopConf;
+  private String metricsPath;
+  private HoodieWriteConfig config;
+  private FileSystem fs;
+  private ScheduledExecutorService executor;
+  private static final String META_FOLDER_NAME = "/.hoodie";
+  private static final String METRICS_FOLDER_NAME = "/metrics";
+  private static final String METRICS_FILE_NAME = "_metrics.json";

Review Comment:
   Currently, the issue of storing multiple versions of results has been 
considered, and an overwrite parameter has been reserved to control this. The 
initial idea is to add a timestamp + cleanup strategy. However, this would be 
quite complex, so it has been temporarily placed in the TODO list, and the 
feasibility of the file system reporter needs to be confirmed first.
   Regarding the current way of overwriting files, by default, the data will be 
written under the ".hoodie" directory of the table and only one file will be 
kept, so there will be no conflicts with tables of the same name. Additionally, 
the file name prefix is designed to handle the overwriting of results in 
different functionalities or table scenarios. There is certainly room for 
optimization in this aspect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (0cb77908357 -> b08874268fb)

2023-11-07 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 0cb77908357 [HUDI-7042] Fix new filegroup reader (#10003)
 add b08874268fb [MINOR] Fix tests that set precombine to nonexistent field 
(#10008)

No new revisions were added by this update.

Summary of changes:
 .../src/test/scala/org/apache/hudi/TestHoodieFileIndex.scala   | 3 ++-
 .../src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala  | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)



Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]

2023-11-07 Thread via GitHub


yihua merged PR #10008:
URL: https://github.com/apache/hudi/pull/10008


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7046) Fix partial merging logic based on projected schema

2023-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7046:
-
Labels: pull-request-available  (was: )

> Fix partial merging logic based on projected schema
> ---
>
> Key: HUDI-7046
> URL: https://issues.apache.org/jira/browse/HUDI-7046
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> When querying the table with multiple round of partial updates generating 
> multiple log files, the partial merging logic may fail or give wrong results 
> due to schema handling and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7046] Fix partial merging logic based on projected reader schema [hudi]

2023-11-07 Thread via GitHub


yihua opened a new pull request, #10011:
URL: https://github.com/apache/hudi/pull/10011

   ### Change Logs
   
   This PR fixes the logic of merging partial updates with projected reader 
schema, i.e., the reader schema contains a subset of fields from the table 
schema based on the query.
   - When processing log records in 
`HoodieBaseFileGroupRecordBuffer#doProcessNextDataRecord`, the schema of the 
combined record is also updated in the metadata since the schema can change due 
to partial merging;
   - A bug of getting the field values from the older record in 
`SparkRecordMergingUtils#mergePartialRecords` is fixed.
   - The partial update tests in `TestPartialUpdateForMergeInto` are enhanced 
to cover partial merging logic.
   
   ### Impact
   
   Makes sure the partial merging logic is correct.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader

2023-11-07 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6790:
--
Status: Patch Available  (was: In Progress)

> Support incremental read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6790
> URL: https://issues.apache.org/jira/browse/HUDI-6790
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-6790) Support incremental read in engine-agnostic FileGroupReader

2023-11-07 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6790.
-
Resolution: Done

> Support incremental read in engine-agnostic FileGroupReader
> ---
>
> Key: HUDI-6790
> URL: https://issues.apache.org/jira/browse/HUDI-6790
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Lin Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7042) Fix filegroup reader

2023-11-07 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7042.
-
Resolution: Fixed

> Fix filegroup reader
> 
>
> Key: HUDI-7042
> URL: https://issues.apache.org/jira/browse/HUDI-7042
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Fix following issues for the new filegroup reader:
> - Handle nested schema
> - Append partition values correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]

2023-11-07 Thread via GitHub


codope commented on code in PR #10007:
URL: https://github.com/apache/hudi/pull/10007#discussion_r1385970457


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala:
##
@@ -74,11 +76,25 @@ class NewHoodieParquetFileFormat(tableState: 
Broadcast[HoodieTableState],
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
 if (!supportBatchCalled) {
   supportBatchCalled = true
-  supportBatchResult = !isMOR && super.supportBatch(sparkSession, schema)
+  supportBatchResult = !isMOR && legacyFF.supportBatch(sparkSession, 
schema)
 }
 supportBatchResult
   }
 
+  private def wrapWithBatchConverter(reader: PartitionedFile => 
Iterator[InternalRow]): PartitionedFile => Iterator[InternalRow] = {

Review Comment:
   why is this needed? i think the flatmap per row could incur some significant 
cost for a large batch. Instrad of wrapping, can it be guarded for some cases 
such as when schema on read is enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10010:
URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801067235

   
   ## CI report:
   
   * 8f048d83427375dcc856ef78872a3d8247c9390f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20723)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]

2023-11-07 Thread via GitHub


codope merged PR #10003:
URL: https://github.com/apache/hudi/pull/10003


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7042] Fix new filegroup reader (#10003)

2023-11-07 Thread codope
This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0cb77908357 [HUDI-7042] Fix new filegroup reader (#10003)
0cb77908357 is described below

commit 0cb7790835775c39b5cf71683b95f7618c6c95cc
Author: Sagar Sumit 
AuthorDate: Wed Nov 8 09:58:28 2023 +0530

[HUDI-7042] Fix new filegroup reader (#10003)
---
 .../hudi/common/model/HoodieSparkRecord.java   |  2 +
 .../read/HoodieBaseFileGroupRecordBuffer.java  |  2 +-
 .../common/table/read/HoodieFileGroupReader.java   |  2 +-
 .../table/read/HoodieFileGroupRecordBuffer.java|  2 +-
 ...odieFileGroupReaderBasedParquetFileFormat.scala | 69 +++---
 .../hudi/functional/TestMORDataSourceStorage.scala | 16 +++--
 .../functional/TestPartialUpdateAvroPayload.scala  | 23 +---
 style/scalastyle.xml   |  2 +-
 8 files changed, 93 insertions(+), 25 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java
 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java
index 3d59ad27257..5cb8800411c 100644
--- 
a/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java
+++ 
b/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/model/HoodieSparkRecord.java
@@ -40,6 +40,7 @@ import org.apache.spark.sql.catalyst.CatalystTypeConverters;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.catalyst.expressions.GenericInternalRow;
 import org.apache.spark.sql.catalyst.expressions.JoinedRow;
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow;
 import org.apache.spark.sql.catalyst.expressions.UnsafeProjection;
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
 import org.apache.spark.sql.types.DataType;
@@ -447,6 +448,7 @@ public class HoodieSparkRecord extends 
HoodieRecord {
 || schema != null && (
 data instanceof HoodieInternalRow
 || data instanceof GenericInternalRow
+|| data instanceof SpecificInternalRow
 || 
SparkAdapterSupport$.MODULE$.sparkAdapter().isColumnarBatchRow(data));
 
 ValidationUtils.checkState(isValid);
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
index 4a1bd08e4ef..90ebf71dfb1 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java
@@ -80,7 +80,7 @@ public abstract class HoodieBaseFileGroupRecordBuffer 
implements HoodieFileGr
   }
 
   @Override
-  public void setBaseFileIteraotr(ClosableIterator baseFileIterator) {
+  public void setBaseFileIterator(ClosableIterator baseFileIterator) {
 this.baseFileIterator = baseFileIterator;
   }
 
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
index b655238412d..2850a77d709 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java
@@ -154,7 +154,7 @@ public final class HoodieFileGroupReader implements 
Closeable {
 baseFilePath.get().getHadoopPath(), start, length, 
readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf)
 : new EmptyIterator<>();
 scanLogFiles();
-recordBuffer.setBaseFileIteraotr(baseFileIterator);
+recordBuffer.setBaseFileIterator(baseFileIterator);
   }
 
   /**
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java
index 680bbf9d705..0bf27cfc71e 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupRecordBuffer.java
@@ -100,7 +100,7 @@ public interface HoodieFileGroupRecordBuffer {
*
* @param baseFileIterator
*/
-  void setBaseFileIteraotr(ClosableIterator baseFileIterator);
+  void setBaseFileIterator(ClosableIterator baseFileIterator);
 
   /**
* Check if next merged record exists.
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/executi

Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]

2023-11-07 Thread via GitHub


codope commented on code in PR #10003:
URL: https://github.com/apache/hudi/pull/10003#discussion_r1385964255


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -201,19 +208,65 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   length,
   shouldUseRecordPosition)
 reader.initRecordIterators()
-
reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala
+// Append partition values to rows and project to output schema
+appendPartitionAndProject(
+  
reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala,
+  requiredSchemaWithMandatory,
+  partitionSchema,
+  outputSchema,
+  partitionValues)
+  }
+
+  private def appendPartitionAndProject(iter: Iterator[InternalRow],
+inputSchema: StructType,
+partitionSchema: StructType,
+to: StructType,
+partitionValues: InternalRow): 
Iterator[InternalRow] = {
+if (partitionSchema.isEmpty) {
+  projectSchema(iter, inputSchema, to)

Review Comment:
   no not really.. `HoodieCatalystExpressionUtils.generateUnsafeProjection` 
checks that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10010:
URL: https://github.com/apache/hudi/pull/10010#issuecomment-1801060357

   
   ## CI report:
   
   * 8f048d83427375dcc856ef78872a3d8247c9390f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1801060265

   
   ## CI report:
   
   * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718)
 
   * bb60d3f2fe5737fc43a700bcc6c37806fe48868a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20722)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix npe for get internal schema [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9984:
URL: https://github.com/apache/hudi/pull/9984#issuecomment-1801060193

   
   ## CI report:
   
   * 23eb3d5bd578bffbd1165f7e178f391ce0056cb9 UNKNOWN
   * 2fb3eb51c728c5d3a9bdd725e77006a5141cc36f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20703)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Can not extract Partition Path with conf populateMetaFields set false and dropPartitionColumns set true [hudi]

2023-11-07 Thread via GitHub


zyl891229 commented on issue #9991:
URL: https://github.com/apache/hudi/issues/9991#issuecomment-1801056154

   > @zyl891229 Yes you are right, there is a issue with bulk_insert operation 
type along with combination of these two things. Although upsert/insert is 
running fine, you may use that. I confirmed both cases are failing , when using 
one partition col or two partition cols
   > 
   > JIRA to track - https://issues.apache.org/jira/browse/HUDI-7040
   > 
   > Reproducible code -
   > 
   > ```
   > spark = get_spark_session(spark_version="3.2", hudi_version="0.14.0")
   > 
   > insert_df = get_insert_df(spark, 10)
   > 
   > hudi_configs = {
   > "hoodie.table.name": TABLE_NAME,
   > "hoodie.datasource.write.recordkey.field": "UUID",
   > "hoodie.datasource.write.precombine.field": "Name",
   > "hoodie.datasource.write.partitionpath.field": "Company",
   > "hoodie.datasource.write.operation": "bulk_insert",
   > "hoodie.datasource.write.hive_style_partitioning": "true",
   > "hoodie.populate.meta.fields": "false",
   > "hoodie.datasource.write.drop.partition.columns": "true"
   > }
   > 
   > 
insert_df.write.format("hudi").mode("append").options(**hudi_configs).save(PATH)
   > ```
   
   
   
   > @zyl891229 Yes you are right, there is a issue with bulk_insert operation 
type along with combination of these two things. Although upsert/insert is 
running fine, you may use that. I confirmed both cases are failing , when using 
one partition col or two partition cols
   > 
   > JIRA to track - https://issues.apache.org/jira/browse/HUDI-7040
   > 
   > Reproducible code -
   > 
   > ```
   > spark = get_spark_session(spark_version="3.2", hudi_version="0.14.0")
   > 
   > insert_df = get_insert_df(spark, 10)
   > 
   > hudi_configs = {
   > "hoodie.table.name": TABLE_NAME,
   > "hoodie.datasource.write.recordkey.field": "UUID",
   > "hoodie.datasource.write.precombine.field": "Name",
   > "hoodie.datasource.write.partitionpath.field": "Company",
   > "hoodie.datasource.write.operation": "bulk_insert",
   > "hoodie.datasource.write.hive_style_partitioning": "true",
   > "hoodie.populate.meta.fields": "false",
   > "hoodie.datasource.write.drop.partition.columns": "true"
   > }
   > 
   > 
insert_df.write.format("hudi").mode("append").options(**hudi_configs).save(PATH)
   > ```
   
   Thank you for your reply. Is there any idea or method can provide? 
   I will first modify this problem in fork. We need to delete useless columns 
to minimize storage space and save cost


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10008:
URL: https://github.com/apache/hudi/pull/10008#issuecomment-1801053296

   
   ## CI report:
   
   * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20715)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1801053081

   
   ## CI report:
   
   * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705)
 
   * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718)
 
   * bb60d3f2fe5737fc43a700bcc6c37806fe48868a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix npe for get internal schema [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9984:
URL: https://github.com/apache/hudi/pull/9984#issuecomment-1801052829

   
   ## CI report:
   
   * 23eb3d5bd578bffbd1165f7e178f391ce0056cb9 UNKNOWN
   * 2fb3eb51c728c5d3a9bdd725e77006a5141cc36f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20703)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix npe for get internal schema [hudi]

2023-11-07 Thread via GitHub


watermelon12138 commented on PR #9984:
URL: https://github.com/apache/hudi/pull/9984#issuecomment-1800970902

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6508] Fix compile errors with JDK11 [hudi]

2023-11-07 Thread via GitHub


bvaradar commented on PR #9300:
URL: https://github.com/apache/hudi/pull/9300#issuecomment-1800965782

   @Zouxxyy : Can you fix the merge conflicts ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


stream2000 commented on code in PR #10010:
URL: https://github.com/apache/hudi/pull/10010#discussion_r1385925628


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import com.codahale.metrics.MetricRegistry;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+public class MetricsFileSystemReporter extends MetricsReporter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(MetricsFileSystemReporter.class);
+  private MetricRegistry metricRegistry;
+  private SerializableConfiguration hadoopConf;
+  private String metricsPath;
+  private HoodieWriteConfig config;
+  private FileSystem fs;
+  private ScheduledExecutorService executor;
+  private static final String META_FOLDER_NAME = "/.hoodie";
+  private static final String METRICS_FOLDER_NAME = "/metrics";
+  private static final String METRICS_FILE_NAME = "_metrics.json";

Review Comment:
   Who is gonna to delete the metrcs file? If we report the metrics multiple 
times for the same table will they overwrite each other? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


stream2000 commented on code in PR #10010:
URL: https://github.com/apache/hudi/pull/10010#discussion_r1385925628


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/MetricsFileSystemReporter.java:
##
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.JsonUtils;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import com.codahale.metrics.MetricRegistry;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Map;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+public class MetricsFileSystemReporter extends MetricsReporter {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(MetricsFileSystemReporter.class);
+  private MetricRegistry metricRegistry;
+  private SerializableConfiguration hadoopConf;
+  private String metricsPath;
+  private HoodieWriteConfig config;
+  private FileSystem fs;
+  private ScheduledExecutorService executor;
+  private static final String META_FOLDER_NAME = "/.hoodie";
+  private static final String METRICS_FOLDER_NAME = "/metrics";
+  private static final String METRICS_FILE_NAME = "_metrics.json";

Review Comment:
   Who is gonna to delete the metrcs file? If we report the metrics multiple 
times, for the same table will them overwrite each other? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix npe for get internal schema [hudi]

2023-11-07 Thread via GitHub


watermelon12138 commented on code in PR #9984:
URL: https://github.com/apache/hudi/pull/9984#discussion_r1385921027


##
hudi-common/src/main/java/org/apache/hudi/common/util/InternalSchemaCache.java:
##
@@ -217,7 +217,11 @@ public static InternalSchema 
getInternalSchemaByVersionId(long versionId, String
 }
 InternalSchema fileSchema = InternalSchemaUtils.searchSchema(versionId, 
SerDeHelper.parseSchemas(latestHistorySchema));
 // step3:
-return fileSchema.isEmptySchema() ? 
AvroInternalSchemaConverter.convert(HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(avroSchema))) : fileSchema;
+return fileSchema.isEmptySchema()
+? StringUtils.isNullOrEmpty(avroSchema)
+  ? InternalSchema.getEmptyInternalSchema()

Review Comment:
   @danny0405 Yes, Some users find this problem in the upgrade scenario(0.12.3 
-> 0.14).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7049) Implement File System-based Metrics Reporter

2023-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7049:
-
Labels: pull-request-available  (was: )

> Implement File System-based Metrics Reporter
> 
>
> Key: HUDI-7049
> URL: https://issues.apache.org/jira/browse/HUDI-7049
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> In addition to real-time monitoring metrics, Hudi also has some result 
> metrics, such as IO for clustering reads and writes. These metrics are 
> meaningful for continuously observing the table service status.
> However, the existing metrics reporter either outputs to the console or 
> memory without persistence, or it outputs to another metrics server, 
> requiring complex environment setup. We hope to provide a simple persistent 
> reporter where users can specify that the metrics be stored in the file 
> system in JSON format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7049] Implement File System-based Metrics Reporter [hudi]

2023-11-07 Thread via GitHub


majian1998 opened a new pull request, #10010:
URL: https://github.com/apache/hudi/pull/10010

   In addition to real-time monitoring metrics, Hudi also has some result 
metrics, such as IO for clustering reads and writes. These metrics are 
meaningful for continuously observing the table service status.
   However, the existing metrics reporter either outputs to the console or 
memory without persistence, or it outputs to another metrics server, requiring 
complex environment setup. We hope to provide a simple persistent reporter 
where users can specify that the metrics be stored in the file system in JSON 
format.
   Ideally, we planned to update the latest version of metrics to the file 
system by calling shutdown through a shutdown hook when finishing. However, at 
that point, the Hudi file system has already closed the connection pool, making 
it impossible to write to the file. Therefore, we update the file by actively 
calling the shutdown function when finishing. Currently, in 
HoodieSparkSqlWriter.cleanup(), the shutdown function is actively called, which 
means metrics are reported at the end of the write process. By doing the same 
in the table service, we can achieve the same effect.
   
   ### Change Logs
   
   Provides a file system-based metrics reporter
   
   ### Impact
   
   Some parameters related to the reporter:
   For example, in hoodie.metrics.reporter.type, FILESYSTEM has been added.
   And FILESYSTEM specifies the address, naming, and whether to enable 
scheduled writing of the metrics file.
   
   
   
   ### Risk level (write none, low medium or high below)
   
   LOW
   
   ### Documentation Update
   
   Metrics report type supports FILESYSTEM
   Updated parameters:
   hoodie.metrics.reporter.type, FILESYSTEM has been added.
   
   New parameters:
   hoodie.metrics.filesystem.reporter.path - The path for persisting Hudi 
storage metrics files.
   hoodie.metrics.filesystem.metric.prefix - The prefix for Hudi storage 
metrics persistence file names.
   hoodie.metrics.filesystem.overwrite.file - Whether to override the same 
metrics file for the same table.
   hoodie.metrics.filesystem.schedule.enable - Whether to enable scheduled 
output of metrics to the file system. Default is off, only need to output the 
final result to the file system.
   hoodie.metrics.filesystem.report.period.seconds - File system reporting 
period in seconds. Default to 60.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9966:
URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800950656

   
   ## CI report:
   
   * fa23cc909cdb9a3381c6646b3446ad44bd7b66d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20607)
 
   * 0a938b13fb76cbba8efce7bfc8edd5927094db67 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20721)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #9966:
URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800945396

   
   ## CI report:
   
   * fa23cc909cdb9a3381c6646b3446ad44bd7b66d0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20607)
 
   * 0a938b13fb76cbba8efce7bfc8edd5927094db67 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]

2023-11-07 Thread via GitHub


ksmou commented on PR #:
URL: https://github.com/apache/hudi/pull/#issuecomment-1800945369

   > @ksmou Can you also update the website about this new param?
   
   got


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7033] Fix read error for schema evolution + partition value extraction (#9994)

2023-11-07 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new a4fa3451916 [HUDI-7033] Fix read error for schema evolution + 
partition value extraction (#9994)
a4fa3451916 is described below

commit a4fa3451916de11dc082792076b62013586dadaf
Author: voonhous 
AuthorDate: Wed Nov 8 10:49:48 2023 +0800

[HUDI-7033] Fix read error for schema evolution + partition value 
extraction (#9994)
---
 .../org/apache/hudi/HoodieDataSourceHelper.scala   | 61 +-
 .../apache/hudi/TestHoodieDataSourceHelper.scala   | 54 +++
 .../org/apache/spark/sql/hudi/TestSpark3DDL.scala  | 41 +++
 3 files changed, 154 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala
index eb8ddfdf870..4add21b5b8d 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala
@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.PredicateHelper
 import org.apache.spark.sql.execution.datasources.PartitionedFile
 import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
-import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.sources.{And, Filter, Or}
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.vectorized.ColumnarBatch
 
@@ -58,7 +58,7 @@ object HoodieDataSourceHelper extends PredicateHelper with 
SparkAdapterSupport {
   dataSchema = dataSchema,
   partitionSchema = partitionSchema,
   requiredSchema = requiredSchema,
-  filters = filters,
+  filters = if (appendPartitionValues) getNonPartitionFilters(filters, 
dataSchema, partitionSchema) else filters,
   options = options,
   hadoopConf = hadoopConf
 )
@@ -98,4 +98,61 @@ object HoodieDataSourceHelper extends PredicateHelper with 
SparkAdapterSupport {
   deserializer.deserialize(avroRecord).get.asInstanceOf[InternalRow]
 }
   }
+
+  def getNonPartitionFilters(filters: Seq[Filter], dataSchema: StructType, 
partitionSchema: StructType): Seq[Filter] = {
+filters.flatMap(f => {
+  if (f.references.intersect(partitionSchema.fields.map(_.name)).nonEmpty) 
{
+extractPredicatesWithinOutputSet(f, dataSchema.fieldNames.toSet)
+  } else {
+Some(f)
+  }
+})
+  }
+
+  /**
+   * Heavily adapted from {@see 
org.apache.spark.sql.catalyst.expressions.PredicateHelper#extractPredicatesWithinOutputSet}
+   * Method is adapted to work with Filters instead of Expressions
+   *
+   * @return
+   */
+  def extractPredicatesWithinOutputSet(condition: Filter,
+   outputSet: Set[String]): Option[Filter] 
= condition match {
+case And(left, right) =>
+  val leftResultOptional = extractPredicatesWithinOutputSet(left, 
outputSet)
+  val rightResultOptional = extractPredicatesWithinOutputSet(right, 
outputSet)
+  (leftResultOptional, rightResultOptional) match {
+case (Some(leftResult), Some(rightResult)) => Some(And(leftResult, 
rightResult))
+case (Some(leftResult), None) => Some(leftResult)
+case (None, Some(rightResult)) => Some(rightResult)
+case _ => None
+  }
+
+// The Or predicate is convertible when both of its children can be pushed 
down.
+// That is to say, if one/both of the children can be partially pushed 
down, the Or
+// predicate can be partially pushed down as well.
+//
+// Here is an example used to explain the reason.
+// Let's say we have
+// condition: (a1 AND a2) OR (b1 AND b2),
+// outputSet: AttributeSet(a1, b1)
+// a1 and b1 is convertible, while a2 and b2 is not.
+// The predicate can be converted as
+// (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)
+// As per the logical in And predicate, we can push down (a1 OR b1).
+case Or(left, right) =>
+  for {
+lhs <- extractPredicatesWithinOutputSet(left, outputSet)
+rhs <- extractPredicatesWithinOutputSet(right, outputSet)
+  } yield Or(lhs, rhs)
+
+// Here we assume all the `Not` operators is already below all the `And` 
and `Or` operators
+// after the optimization rule `BooleanSimplification`, so that we don't 
need to handle the
+// `Not` operators here.
+case other =>
+  if (other.references.toSet.subsetOf(outputSet)) {
+Some(other)
+  } else {
+None
+  }
+  }
 }
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieDataSour

Re: [PR] [HUDI-7033] Fix read error for schema evolution + partition value ext… [hudi]

2023-11-07 Thread via GitHub


bvaradar merged PR #9994:
URL: https://github.com/apache/hudi/pull/9994


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7049) Implement File System-based Metrics Reporter

2023-11-07 Thread Ma Jian (Jira)
Ma Jian created HUDI-7049:
-

 Summary: Implement File System-based Metrics Reporter
 Key: HUDI-7049
 URL: https://issues.apache.org/jira/browse/HUDI-7049
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ma Jian


In addition to real-time monitoring metrics, Hudi also has some result metrics, 
such as IO for clustering reads and writes. These metrics are meaningful for 
continuously observing the table service status.
However, the existing metrics reporter either outputs to the console or memory 
without persistence, or it outputs to another metrics server, requiring complex 
environment setup. We hope to provide a simple persistent reporter where users 
can specify that the metrics be stored in the file system in JSON format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7036) Reduce the driver memory pressure during buildProfile

2023-11-07 Thread Lin Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu closed HUDI-7036.
-
Resolution: Fixed

> Reduce the driver memory pressure during buildProfile
> -
>
> Key: HUDI-7036
> URL: https://issues.apache.org/jira/browse/HUDI-7036
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
>
> The record distribution should be based on (partition_path, instant_time, 
> file_id), instead of (partition_path, instant_time, file_id, position).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10009:
URL: https://github.com/apache/hudi/pull/10009#issuecomment-1800913449

   
   ## CI report:
   
   * 8242310dbf950c9ece07c4b4fe5593e70e0bedf4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20719)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1800913385

   
   ## CI report:
   
   * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705)
 
   * 4dcb8f7bea46847202c2444e1a99901484239f4f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20718)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6993] Support Flink 1.18 [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on code in PR #9949:
URL: https://github.com/apache/hudi/pull/9949#discussion_r1385884858


##
packaging/bundle-validation/ci_run.sh:
##
@@ -162,6 +162,8 @@ else
 HUDI_FLINK_BUNDLE_NAME=hudi-flink1.16-bundle
   elif [[ ${FLINK_PROFILE} == 'flink1.17' ]]; then
 HUDI_FLINK_BUNDLE_NAME=hudi-flink1.17-bundle
+  elif [[ ${FLINK_PROFILE} == 'flink1.18' ]]; then
+HUDI_FLINK_BUNDLE_NAME=hudi-flink1.18-bundle

Review Comment:
   The `IMAGE_TAG` should be updated as flink 1.18 after we uploaded the docker 
image. cc @codope for the help ~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]

2023-11-07 Thread via GitHub


voonhous commented on PR #9966:
URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800910433

   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10009:
URL: https://github.com/apache/hudi/pull/10009#issuecomment-1800908265

   
   ## CI report:
   
   * 8242310dbf950c9ece07c4b4fe5593e70e0bedf4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7041] Optimize the mem usage of partitionToFileGroupsMap during the cleaning [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10002:
URL: https://github.com/apache/hudi/pull/10002#issuecomment-1800908197

   
   ## CI report:
   
   * d10a137bff419d0d4befb5dac8380ac0bf0f12f8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20705)
 
   * 4dcb8f7bea46847202c2444e1a99901484239f4f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7045] fix evolution by using legacy ff for reader [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10007:
URL: https://github.com/apache/hudi/pull/10007#issuecomment-1800900579

   
   ## CI report:
   
   * 93c52c5602738f2d39dd5942d5e5cde940f843f3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20713)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7042] Fix new filegroup reader [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on code in PR #10003:
URL: https://github.com/apache/hudi/pull/10003#discussion_r1385866309


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##
@@ -201,19 +208,65 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
   length,
   shouldUseRecordPosition)
 reader.initRecordIterators()
-
reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala
+// Append partition values to rows and project to output schema
+appendPartitionAndProject(
+  
reader.getClosableIterator.asInstanceOf[java.util.Iterator[InternalRow]].asScala,
+  requiredSchemaWithMandatory,
+  partitionSchema,
+  outputSchema,
+  partitionValues)
+  }
+
+  private def appendPartitionAndProject(iter: Iterator[InternalRow],
+inputSchema: StructType,
+partitionSchema: StructType,
+to: StructType,
+partitionValues: InternalRow): 
Iterator[InternalRow] = {
+if (partitionSchema.isEmpty) {
+  projectSchema(iter, inputSchema, to)

Review Comment:
   Do we still project the rows if the `iter` is already in the required output 
schema?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer

2023-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7048:
-
Labels: pull-request-available  (was: )

> Fix checkpoint loss issue when changing MOR to COW in streamer
> --
>
> Key: HUDI-7048
> URL: https://issues.apache.org/jira/browse/HUDI-7048
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7048] Fix checkpoint loss issue when changing MOR to COW in streamer [hudi]

2023-11-07 Thread via GitHub


danny0405 merged PR #10001:
URL: https://github.com/apache/hudi/pull/10001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7048.

Resolution: Fixed

Fixed via master branch: eeec775f3803cf231f041196aa11ca1c83228ea8

> Fix checkpoint loss issue when changing MOR to COW in streamer
> --
>
> Key: HUDI-7048
> URL: https://issues.apache.org/jira/browse/HUDI-7048
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (fe554d89460 -> eeec775f380)

2023-11-07 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from fe554d89460 [HUDI-7038] RunCompactionProcedure support limit parameter 
(#)
 add eeec775f380 [HUDI-7048] Fix checkpoint loss issue when changing MOR to 
COW in streamer (#10001)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/utilities/streamer/StreamSync.java |  5 +-
 .../deltastreamer/TestHoodieDeltaStreamer.java | 68 ++
 2 files changed, 71 insertions(+), 2 deletions(-)



[jira] [Created] (HUDI-7048) Fix checkpoint loss issue when changing MOR to COW in streamer

2023-11-07 Thread Danny Chen (Jira)
Danny Chen created HUDI-7048:


 Summary: Fix checkpoint loss issue when changing MOR to COW in 
streamer
 Key: HUDI-7048
 URL: https://issues.apache.org/jira/browse/HUDI-7048
 Project: Apache Hudi
  Issue Type: Improvement
  Components: deltastreamer
Reporter: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7038) RunCompactionProcedure support limit parameter

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7038.

Resolution: Fixed

Fixed via master branch: fe554d894601e20e95b10dca86bbe1ee71df4856

> RunCompactionProcedure support limit parameter
> --
>
> Key: HUDI-7038
> URL: https://issues.apache.org/jira/browse/HUDI-7038
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated (f33459b2ea2 -> fe554d89460)

2023-11-07 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from f33459b2ea2 [HUDI-7039] PartialUpdateAvroPayload preCombine failed 
need show details (#1)
 add fe554d89460 [HUDI-7038] RunCompactionProcedure support limit parameter 
(#)

No new revisions were added by this update.

Summary of changes:
 .../procedures/RunCompactionProcedure.scala|  6 ++-
 .../hudi/procedure/TestCompactionProcedure.scala   | 47 ++
 2 files changed, 51 insertions(+), 2 deletions(-)



Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on PR #:
URL: https://github.com/apache/hudi/pull/#issuecomment-1800878801

   @ksmou Can you also update the website about this new param?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7038] RunCompactionProcedure support limit parameter [hudi]

2023-11-07 Thread via GitHub


danny0405 merged PR #:
URL: https://github.com/apache/hudi/pull/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7038) RunCompactionProcedure support limit parameter

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7038:
-
Fix Version/s: 1.0.0

> RunCompactionProcedure support limit parameter
> --
>
> Key: HUDI-7038
> URL: https://issues.apache.org/jira/browse/HUDI-7038
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7017] Prevent full schema evolution from wrongly falling back t… [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on PR #9966:
URL: https://github.com/apache/hudi/pull/9966#issuecomment-1800876294

   @voonhous Can you rebase with the latest master to resolve the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7039) PartialUpdateAvroPayload preCombine failed need show details

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7039:
-
Fix Version/s: 1.0.0

> PartialUpdateAvroPayload preCombine failed need show details
> 
>
> Key: HUDI-7039
> URL: https://issues.apache.org/jira/browse/HUDI-7039
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> currently PartialUpdateAvroPayload preCombine would not show details even 
> when failed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7039) PartialUpdateAvroPayload preCombine failed need show details

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7039.

Resolution: Fixed

Fixed via master branch: f33459b2ea2ae240b49dcf94d8e7715f57c80c5d

> PartialUpdateAvroPayload preCombine failed need show details
> 
>
> Key: HUDI-7039
> URL: https://issues.apache.org/jira/browse/HUDI-7039
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: xy
>Assignee: xy
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> currently PartialUpdateAvroPayload preCombine would not show details even 
> when failed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(hudi) branch master updated: [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details (#10000)

2023-11-07 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f33459b2ea2 [HUDI-7039] PartialUpdateAvroPayload preCombine failed 
need show details (#1)
f33459b2ea2 is described below

commit f33459b2ea2ae240b49dcf94d8e7715f57c80c5d
Author: xuzifu666 
AuthorDate: Wed Nov 8 09:50:03 2023 +0800

[HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details 
(#1)

Co-authored-by: xuyu <11161...@vivo.com>
---
 .../java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java  | 5 +
 1 file changed, 5 insertions(+)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java
index 27e744c4925..91b66e004e5 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateAvroPayload.java
@@ -29,6 +29,8 @@ import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 import java.io.IOException;
 import java.util.List;
@@ -117,6 +119,8 @@ import java.util.Properties;
  */
 public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvroPayload {
 
+  private static final Logger LOG = 
LoggerFactory.getLogger(PartialUpdateAvroPayload.class);
+
   public PartialUpdateAvroPayload(GenericRecord record, Comparable 
orderingVal) {
 super(record, orderingVal);
   }
@@ -141,6 +145,7 @@ public class PartialUpdateAvroPayload extends 
OverwriteNonDefaultsWithLatestAvro
 shouldPickOldRecord ? oldValue.orderingVal : this.orderingVal);
   }
 } catch (Exception ex) {
+  LOG.warn("PartialUpdateAvroPayload precombine failed with ", ex);
   return this;
 }
 return this;



Re: [PR] [HUDI-7039] PartialUpdateAvroPayload preCombine failed need show details [hudi]

2023-11-07 Thread via GitHub


danny0405 merged PR #1:
URL: https://github.com/apache/hudi/pull/1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix npe for get internal schema [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on code in PR #9984:
URL: https://github.com/apache/hudi/pull/9984#discussion_r1385855432


##
hudi-common/src/main/java/org/apache/hudi/common/util/InternalSchemaCache.java:
##
@@ -217,7 +217,11 @@ public static InternalSchema 
getInternalSchemaByVersionId(long versionId, String
 }
 InternalSchema fileSchema = InternalSchemaUtils.searchSchema(versionId, 
SerDeHelper.parseSchemas(latestHistorySchema));
 // step3:
-return fileSchema.isEmptySchema() ? 
AvroInternalSchemaConverter.convert(HoodieAvroUtils.addMetadataFields(new 
Schema.Parser().parse(avroSchema))) : fileSchema;
+return fileSchema.isEmptySchema()
+? StringUtils.isNullOrEmpty(avroSchema)
+  ? InternalSchema.getEmptyInternalSchema()

Review Comment:
   Is it because the version upgrade or something? Is the null avro schema 
coming from an old version Hudi table?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6217] Support handing '_hoodie_operation' meta field for Spark snapshot source [hudi]

2023-11-07 Thread via GitHub


danny0405 commented on PR #8721:
URL: https://github.com/apache/hudi/pull/8721#issuecomment-1800869553

   @beyond1920 You can just take it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated (7ce62fc5793 -> 3d8e72a20fe)

2023-11-07 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 7ce62fc5793 [MINOR] Remove rocksdb version from m1 profile (#10006)
 add 3d8e72a20fe [HUDI-7010] Build clustering group reduces redundant 
traversals (#9957)

No new revisions were added by this update.

Summary of changes:
 .../PartitionAwareClusteringPlanStrategy.java  |  5 
 ...TestSparkBuildClusteringGroupsForPartition.java | 30 ++
 2 files changed, 35 insertions(+)



[jira] [Closed] (HUDI-7010) Build clustering group reduces redundant traversals

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-7010.

Resolution: Fixed

Fixed via master branch: 3d8e72a20fe161839815bc8143b277c93b3c93eb

> Build clustering group reduces redundant traversals
> ---
>
> Key: HUDI-7010
> URL: https://issues.apache.org/jira/browse/HUDI-7010
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7010) Build clustering group reduces redundant traversals

2023-11-07 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7010:
-
Fix Version/s: 0.14.1

> Build clustering group reduces redundant traversals
> ---
>
> Key: HUDI-7010
> URL: https://issues.apache.org/jira/browse/HUDI-7010
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: clustering
>Reporter: kwang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7010] Build clustering group reduces redundant traversals [hudi]

2023-11-07 Thread via GitHub


danny0405 merged PR #9957:
URL: https://github.com/apache/hudi/pull/9957


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7047) Fix incremental queries using new file format

2023-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7047:
-
Labels: pull-request-available  (was: )

> Fix incremental queries using new file format
> -
>
> Key: HUDI-7047
> URL: https://issues.apache.org/jira/browse/HUDI-7047
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/hudi/pull/9888] introduced some issues that cause 
> reads to fail in some tests when the new file format is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7047] fix various issues with incremental queries in new file format [hudi]

2023-11-07 Thread via GitHub


jonvex opened a new pull request, #10009:
URL: https://github.com/apache/hudi/pull/10009

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7047) Fix incremental queries using new file format

2023-11-07 Thread Jonathan Vexler (Jira)
Jonathan Vexler created HUDI-7047:
-

 Summary: Fix incremental queries using new file format
 Key: HUDI-7047
 URL: https://issues.apache.org/jira/browse/HUDI-7047
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


[https://github.com/apache/hudi/pull/9888] introduced some issues that cause 
reads to fail in some tests when the new file format is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10008:
URL: https://github.com/apache/hudi/pull/10008#issuecomment-1800857372

   
   ## CI report:
   
   * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20715)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [MINOR] Fix tests that set precombine to nonexistent field [hudi]

2023-11-07 Thread via GitHub


hudi-bot commented on PR #10008:
URL: https://github.com/apache/hudi/pull/10008#issuecomment-1800851584

   
   ## CI report:
   
   * b781cdd3f8a6ac42ff96eacd7c1ec4c132106dd9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7046) Fix partial merging logic based on projected schema

2023-11-07 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-7046:
---

Assignee: Ethan Guo

> Fix partial merging logic based on projected schema
> ---
>
> Key: HUDI-7046
> URL: https://issues.apache.org/jira/browse/HUDI-7046
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When querying the table with multiple round of partial updates generating 
> multiple log files, the partial merging logic may fail or give wrong results 
> due to schema handling and merging logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >